Deploying TKG v1.2.0 (TKGm) in an internet-restricted environment using Harbor

In this post, I am going to outline the steps involved to successfully deploy a Tanzu Kubernetes Grid (TKG) management cluster and workload clusters in an internet restricted environment. [Note: since first writing this article, we appear to have standardized on TGKm – TKG multi-cloud – for this product. This is often referred to as an air-gapped environment. Note that for part of this exercise, a virtual machine will need to be connected to the internet in order to pull down the images requires for TKG. Once these have been downloaded and pushed up to our local Harbor container image registry, the internet connection can be removed and we will work in a completely air-gapped environment.

Note that TKG here refers to the TKG distribution that deploys a management cluster, and provides a tkg CLI to deploy TKG “workload” clusters. As mentioned, this is now been marketed as TKGm. This is different to the TKG found in vSphere with Tanzu (and VCF with Tanzu). vSphere with Tanzu provides many unique advanced features such as vSphere Namespaces, vSphere SSO integration, TKGs – TKG service – (for deploying TKG “guest” clusters), vSphere Network Service, vSphere Storage Service and, if you have NSX-T, the vSphere Pods Service and vSphere Registry Service. However, for the purposes on this post, we are working with the former.

Prerequisites

I am using Ubuntu 18.04 as the Guest OS to deploy Harbor as well as run my tkg CLI commands. You could use another distro, but some commands shown here are specific to Ubuntu. I already have both docker (v19.03.13) and docker-compose (v1.27.4) installed. I’m not going to cover the installation of these – there are plenty of examples already available. One thing that might be useful is to avoid putting sudo in front of every docker command. Information on how to add your user as a trusted docker user can be found here.

We also need to have the TKG binaries downloaded. Note that this exercise uses TKG v1.2.0. This allows us to used a self-signed CA certificate which is useful in home-labs and non-production environments, but certainly not advisable in production environments. The binaries are available here: https://www.vmware.com/go/get-tkg. You will also need to have the appropriate OVA template available on your vSphere environment for the TKG images. This image will be used to build the control plane and worker nodes.

TKG requires DHCP to be available on your internet restricted environment. I used dnsmasq to provide both DNS and DHCP to my internet restricted environment, which is again very useful in non-production, home-lab type environments. I used guidance from both linuxhints and computing for geeks to configured dnsmasq. Ensure these are working correctly by using the appropriate tools (e.g. nslookup, dig) before proceeding with the TKG deployment.

Harbor, the VMware Container Image Registry, should already be installed and running. I have posted a blog on how to do this earlier. It is extremely important that all the appropriate Docker and Harbor certificates and keys are in place. This is covered in the Harbor post. For this new post, I’ve redeployed Harbor once again, and here is the status of it, along with a simple test to show that we can login using both http & https as well as push local images to it.

$ sudo docker ps -a
CONTAINER ID        IMAGE                                COMMAND                  CREATED              STATUS                        PORTS                                         NAMES
b1da486c7e16        goharbor/nginx-photon:v2.1.0         "nginx -g 'daemon of…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:80->8080/tcp, 0.0.0.0:443->8443/tcp   nginx
ff6d0b00ed66        goharbor/harbor-jobservice:v2.1.0    "/harbor/entrypoint.…"   About a minute ago   Up About a minute (healthy)                                                 harbor-jobservice
1d2522c10110        goharbor/harbor-core:v2.1.0          "/harbor/entrypoint.…"   About a minute ago   Up About a minute (healthy)                                                 harbor-core
32e2e1246de3        goharbor/harbor-db:v2.1.0            "/docker-entrypoint.…"   About a minute ago   Up About a minute (healthy)                                                 harbor-db
11960021374d        goharbor/redis-photon:v2.1.0         "redis-server /etc/r…"   About a minute ago   Up About a minute (healthy)                                                 redis
b14fd57ed92b        goharbor/harbor-portal:v2.1.0        "nginx -g 'daemon of…"   About a minute ago   Up About a minute (healthy)                                                 harbor-portal
89f947fe7077        goharbor/harbor-registryctl:v2.1.0   "/home/harbor/start.…"   About a minute ago   Up About a minute (healthy)                                                 registryctl
8d44c29fc175        goharbor/registry-photon:v2.1.0      "/home/harbor/entryp…"   About a minute ago   Up About a minute (healthy)                                                 registry
873df21d0338        goharbor/harbor-log:v2.1.0           "/bin/sh -c /usr/loc…"   About a minute ago   Up About a minute (healthy)   127.0.0.1:1514->10514/tcp                     harbor-log


$ sudo docker login cormac-mgmt.corinternal.com
Username: admin
Password: ******
WARNING! Your password will be stored unencrypted in /home/cormac/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

$ sudo docker login https://cormac-mgmt.corinternal.com
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /home/cormac/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
0e03bdcc26d7: Pull complete
Digest: sha256:e7c70bb24b462baa86c102610182e3efcb12a04854e8c582838d92970a09f323
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

$ sudo docker tag hello-world:latest cormac-mgmt.corinternal.com/library/hello-world:latest

$ sudo docker push cormac-mgmt.corinternal.com/library/hello-world
The push refers to repository [cormac-mgmt.corinternal.com/library/hello-world]
9c27e219663c: Pushed
latest: digest: sha256:90659bf80b44ce6be8234e6ff90a1ac34acbeb826903b02cfa0da11c82cbc042 size: 525

Note that when I created the certificates for Harbor in my previous post, I used my own internal CA. If you want to establish trust between TKG and Harbor, we will need to either (a) have the TKG nodes to trust your own internal CA or (b) bypass the certificate validation checks. For (a), my colleague Chip Zoller describes how to inject your own CA cert into the TKG nodes in his blog post here. Since this is only a lab environment, and not a production environment, I’m going to implement option (b) and set an environment variable to bypass the certificate validation, as we will see shortly. Again, for production environments, consider using a valid CA, or follow Chip’s guidance above.

Continuing with the prerequisites, we also need kubectl installed. Using the snap feature was the easiest way I found to install it on Ubuntu 18.04.

$ sudo snap install kubectl --classic
2020-11-30T11:00:40Z INFO Waiting for automatic snapd restart...
kubectl 1.19.4 from Canonical✓ installed

Finally, you will need the tools yq and jq installed. These are required by a script (seen later) which parses the TKG BOMs for the required container images. This in turn creates a new script which pulls the original images from the TKG registry (registry.tkg.vmware.run) and pushes them up to our local Harbor image registry so that TKG can access them. We will see how this is done shortly.

You will need to pull down yq as follows – I used version 3.4.1.

$ sudo wget https://github.com/mikefarah/yq/releases/download/3.4.1/yq_linux_amd64 -O /usr/bin/yq
$ sudo chmod +x /usr/bin/yq

On Ubuntu, jq can be installed as follows:

$ sudo apt-get install jq
$ chmod +x /usr/bin/jq

All of prerequisites are now in place – we can start will the pulling of the images from the public TKG registry and pushing them up to our private Harbor registry.

Pushing TKG images to local Harbor registry

This procedure requires the script found in the official Tanzu documentation here. The gen-publish-images.sh script traverses all of the BOM manifest files found in your local .tkg/bom folder and creates a new script from the output. This resulting script then pulls the images from the public TKG registry and pushes them to your local Harbor registry. As I mentioned earlier, I am not going to inject my own CA cert into my TKG nodes. Instead, I’m going to bypass this check with the use of an environment variable that relaxes certificate verification. For production environments, you would not want to take this approach, and instead use a secure method.

Before running the script, set the following variables:

TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY=true
TKG_CUSTOM_IMAGE_REPOSITORY=cormac-tkgm.corinternal.com/library

To ensure that these variables stay set, I added them to my $HOME/.bash_profile and then sourced the .bash_profile. Alternatively, log out and log back in again to ensure the variables are set correctly.

$ env | grep TKG
TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY=true
TKG_CUSTOM_IMAGE_REPOSITORY=cormac-tkgm.corinternal.com/library

To create the .tgk folder and sub-folders in you r$HOME, the following command should be run.

$ tkg get management-cluster

Finally, before we run the gen-publish-images.sh script, we can reduce the amount of time taken to pull and push the TKG images by removing a number of the bom manifests. You should do this only if you are certain you are never going to use some of the older Kubernetes versions available. In my case, I created a sub-folder in .tkg called oldbom, and moved a number of the older manifests out, so that I only have a reduced number of manifests from 9 to 3 which significantly reduces the number ofimages that require pulling and pushing.

$ cd .tkg/bom
$ ls
bom-1.17.11+vmware.1.yaml bom-1.18.8+vmware.1.yaml bom-1.2.0+vmware.1.yaml
$ ls ../oldbom/
bom-1.1.0+vmware.1.yaml bom-1.1.2+vmware.1.yaml bom-1.1.3+vmware.1.yaml \
bom-1.17.6+vmware.1.yaml bom-1.17.9+vmware.1.yaml bom-tkg-1.0.0.yaml

We can now run the gen-publish-images.sh script to get a list of all of the images, and configure the pull and push commands. The output is redirected to another script, which I have called publish-images.sh. We then run the latter script to do the actual pulling and pushing.

$ ./gen-publish-images.sh > publish-images.sh

$ chmod +x publish-images.sh

$ more publish-images.sh
docker pull registry.tkg.vmware.run/prometheus/alertmanager:v0.20.0_vmware.1
docker tag  registry.tkg.vmware.run/prometheus/alertmanager:v0.20.0_vmware.1 cormac-tkgm.corinternal.com/library/prometheus/alertmanager:v0.20.0_vmware.1
docker push cormac-tkgm.corinternal.com/library/prometheus/alertmanager:v0.20.0_vmware.1

docker pull registry.tkg.vmware.run/antrea/antrea-debian:v0.9.3_vmware.1
docker tag  registry.tkg.vmware.run/antrea/antrea-debian:v0.9.3_vmware.1 cormac-tkgm.corinternal.com/library/antrea/antrea-debian:v0.9.3_vmware.1
docker push cormac-tkgm.corinternal.com/library/antrea/antrea-debian:v0.9.3_vmware.1
.
--<snip>
.

$ ./publish-images.sh
registry.tkg.vmware.run/velero/velero-plugin-for-vsphere:v1.0.2_vmware.1
The push refers to repository [cormac-tkgm.corinternal.com/library/velero/velero-plugin-for-vsphere]
fc075a5f6276: Layer already exists
9cd4316ae370: Layer already exists
5fec6d6d7c8c: Layer already exists
e7932ff84389: Layer already exists
40af5eccbc98: Layer already exists
895dac616c95: Layer already exists
e53100abc225: Layer already exists
767a7b7a8ec5: Layer already exists
v1.0.2_vmware.1: digest: sha256:68a0334bf06747b87650c618c713bff7c28836183cbafa13bb81c18c250a272a size: 1991
v1.4.2_vmware.1: Pulling from velero/velero-restic-restore-helper
Digest: sha256:8e0756ecfc07e0e4812daec3dce44b6ccef5fc64aa0f438e42a6592b2cf2a634
Status: Image is up to date for registry.tkg.vmware.run/velero/velero-restic-restore-helper:v1.4.2_vmware.1
.
--<snip>
.

When the script completes processing, our internal Harbor image registry should now contain all the necessary images to do a deployment of TKG. The connection to the external internet can now be removed from this virtual machines, and you should be able to run the rest of these commands in your air-gapped, internet restricted environment.

Initialize the TKG config.yaml

In the .tkg folder, there exists a config.yaml. This needs to be populated with a bunch of additional information about the vSphere environment. The simplest way to do this is to launch the TKG manager UI and populate the fields accordingly. The UI is launched with the following command:

$ tkg init --ui

Logs of the command execution can also be found at: /tmp/tkg-20201130T110113689528616.log

Validating the pre-requisites...
Serving kickstart UI at http://127.0.0.1:8080

You can now launch a browser and point it to the URL for the kickstart UI above. I won’t show how to populate all of the fields – it is pretty self-explanatory. Note that you will need to have the required OVA deployed on your vSphere environment and converted to a template for the UI to pick it up. Full details are in the the official TKG documentation. Here are some details taken from my environment.

Notice at the bottom of this page the CLI command is provided. We will not deploy the configuration from the UI, but instead take the tkg CLI command and run it manually.

The reason for not deploying this via the UI is to make an additional change to the .tkg/config.yaml. This may or may not be necessary, but in my testing I also added the environment variables seen earlier to the manifest.

$ head -4 .tkg/config.yaml
TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY: true
TKG_CUSTOM_IMAGE_REPOSITORY: cormac-tkgm.corinternal.com/library
cert-manager-timeout: 30m0s
overridesFolder: /home/cormac/.tkg/overrides

We can now go ahead with the deployment of the TKG management cluster.

Deploy the TKG management cluster

All our images for the TKG management cluster should now be pulled from the local Harbor image registry. There should be no attempt made to pull images from the public TKG registry so long as the environment variables are pointing at the correct locations. By having the skip TLS verify environment variable setting, we should also avoid any X509 certificate errors, which would otherwise be observed by the KinD kickstart node, and the TKG nodes when trying to access container images from the Harbor registry.

Note that this is TKG v1.2.0 Thus there are some new options to the command line, including the ability to specify a load balancer IP address (no need to deploy a HA Proxy OVA in this version) as well as the ability to pick a CNI, in this case Antrea. For more on Antrea, check out this recent blog post. Note that you are also offered the option of using vSphere with Tanzu or VCF with Tanzu. However, in this case I am going to proceed with TKG and deploy a non-integrated TKG management cluster on vSphere.

$ tkg init -i vsphere --vsphere-controlplane-endpoint-ip 10.35.13.244 -p prod --cni antrea
Logs of the command execution can also be found at: /tmp/tkg-20201130T140502473586632.log

Validating the pre-requisites...

vSphere 7.0 Environment Detected.

You have connected to a vSphere 7.0 environment which does not have vSphere with Tanzu enabled. vSphere with Tanzu includes
an integrated Tanzu Kubernetes Grid Service which turns a vSphere cluster into a platform for running Kubernetes workloads in dedicated
resource pools. Configuring Tanzu Kubernetes Grid Service is done through vSphere HTML5 client.

Tanzu Kubernetes Grid Service is the preferred way to consume Tanzu Kubernetes Grid in vSphere 7.0 environments. Alternatively you may
deploy a non-integrated Tanzu Kubernetes Grid instance on vSphere 7.0.
Do you want to configure vSphere with Tanzu? [y/N]: N
Would you like to deploy a non-integrated Tanzu Kubernetes Grid management cluster on vSphere 7.0? [y/N]: y
Deploying TKG management cluster on vSphere 7.0 ...

Setting up management cluster...
Validating configuration...
Using infrastructure provider vsphere:v0.7.1
Generating cluster configuration...
Setting up bootstrapper...
Bootstrapper created. Kubeconfig: /home/cormac/.kube-tkg/tmp/config_GsUp1GFQ
Installing providers on bootstrapper...
Fetching providers
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.10" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.10" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.10" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.1" TargetNamespace="capv-system"
Start creating management cluster...
Saving management cluster kuebconfig into /home/cormac/.kube/config
Installing providers on management cluster...
Fetching providers
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.10" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.10" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.10" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.1" TargetNamespace="capv-system"
Waiting for the management cluster to get ready for move...
Waiting for addons installation...
Moving all Cluster API objects from bootstrap cluster to management cluster...
Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Creating objects in the target cluster
Deleting objects from the source cluster
Context set for management cluster tkg-mgmt-vsphere-20201130140507 as 'tkg-mgmt-vsphere-20201130140507-admin@tkg-mgmt-vsphere-20201130140507'.

Management cluster created!

You can now create your first workload cluster by running the following:

tkg create cluster [name] --kubernetes-version=[version] --plan=[plan]
$

$ tkg get management-cluster
MANAGEMENT-CLUSTER-NAME CONTEXT-NAME STATUS
tkg-mgmt-vsphere-20201130140507 * tkg-mgmt-vsphere-20201130140507-admin@tkg-mgmt-vsphere-20201130140507 Success

Excellent! We see the KinD bootstrap cluster deployed initially as a container in docker, then we see that used to create the TKG management cluster as a set of VMs. Once everything is up and running, context switches from the KinD cluster to the TKG management cluster, and the KinD cluster is removed. All of this has been achieved using images in our Harbor image registry – there was no need to pull an images from external repositories. For more details about what is happening during this initialization process, check out this earlier post that I wrote on KinD and TKG.

Deploy a TKG workload cluster

Now that the management cluster is up and running, we can do our final test. Let’s deploy a workload cluster using the tkg command line tool. As before, we need to provide a Load Balancer IP address for the control plane. I’ve also requested that there be 3 nodes in the control plane, and 5 worker nodes.

$ tkg create cluster my-cluster --plan=prod --controlplane-machine-count=3 --worker-machine-count=5 \
--vsphere-controlplane-endpoint-ip 10.35.13.246
Logs of the command execution can also be found at: /tmp/tkg-20201130T142739702360132.log
Validating configuration...
Creating workload cluster 'my-cluster'...
Waiting for cluster to be initialized...
Waiting for cluster nodes to be available...
Waiting for addons installation...

Workload cluster 'my-cluster' created

$ tkg get cluster --include-management-cluster
NAME                             NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
my-cluster                       default     running  3/3           5/5      v1.19.1+vmware.2  <none>
tkg-mgmt-vsphere-20201130140507  tkg-system  running  3/3           1/1      v1.19.1+vmware.2  management

Both the TKG management cluster VMs and workload cluster VMs are all visible in the vSphere client.

TKG is running successfully in an air-gapped, internet restricted environment.

Troubleshooting

I spent quite a bit of time in getting this functionality to work. I thought it might be useful to share some of that experience with you.

Gotchas

Let’s start with the gotchas. The ability to use your own self-signed certificates with the environment variable TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY is only available in TKG v1.2.0. I spent a lot of time with TKG v1.1.3 and constantly hit X509 certificate issues when the Kind node was trying to pull images (namely the certificate containers) from the Harbor registry. It was only after moving to v1.2.0 that I was able to get this to work successfully with the environment detail. Again, this is ok for non-production setups, but for production, you should really use a signed CA, or if you do want to use your own trusted root CA, you need to inject these into the TKG nodes on every cluster.

tkg init verbosity

One of the reasons why I ran tkg init at the command line rather than via the UI is that you can add verbosity to the output. If you run the command with a -v 5 or a -v 9, you get a lot more information about the steps that are currently taking place during the deployment, which can be very useful for troubleshooting.

kubectl on Kind components

If you are interested in querying the state of the bootstrap Kind cluster, the Kubernetes configuration file can be found in.kube-tkg/tmp. You can now use the kubectl command downloaded earlier to query the status of various objects. Here is an example of displaying the Pods running in the Kind cluster. This is extremely useful for checking if the Pods are able to pull their required images successfully from the Harbor repository.

$ kubectl get pods -A --kubeconfig .kube-tkg/tmp/config_2sivlONl
NAMESPACE                           NAME                                                                  READY   STATUS    RESTARTS   AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-748cff6cd9-rsxkh            2/2     Running   0          14s
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-5fb647c458-zdkd2        2/2     Running   0          12s
capi-system                         capi-controller-manager-686c54469c-9rp86                              2/2     Running   0          15s
capi-webhook-system                 capi-controller-manager-5d66994b4b-zpqqb                              1/2     Running   0          16s
capi-webhook-system                 capi-kubeadm-bootstrap-controller-manager-86b5cbdc78-9xms4            2/2     Running   0          15s
capi-webhook-system                 capi-kubeadm-control-plane-controller-manager-d9c45cbc-qr5zb          2/2     Running   0          13s
capi-webhook-system                 capv-controller-manager-77c9948bb7-cpm48                              1/2     Running   0          11s
capv-system                         capv-controller-manager-6bcd99dfd-ddxmr                               1/2     Running   0          10s
cert-manager                        cert-manager-6bd4f58b67-tx98d                                         1/1     Running   0          34s
cert-manager                        cert-manager-cainjector-85dd796c84-lwqsj                              1/1     Running   0          34s
cert-manager                        cert-manager-webhook-5fffc4d84c-x4kbj                                 1/1     Running   0          34s
kube-system                         coredns-5bcf65484d-9cpp8                                              1/1     Running   0          46s
kube-system                         coredns-5bcf65484d-rlcxx                                              1/1     Running   0          46s
kube-system                         etcd-tkg-kind-bv2f4a0cnnikkm7aehh0-control-plane                      0/1     Running   0          56s
kube-system                         kindnet-j29wn                                                         1/1     Running   0          46s
kube-system                         kube-apiserver-tkg-kind-bv2f4a0cnnikkm7aehh0-control-plane            1/1     Running   0          56s
kube-system                         kube-controller-manager-tkg-kind-bv2f4a0cnnikkm7aehh0-control-plane   0/1     Running   0          56s
kube-system                         kube-proxy-cg958                                                      1/1     Running   0          46s
kube-system                         kube-scheduler-tkg-kind-bv2f4a0cnnikkm7aehh0-control-plane            0/1     Running   0          56s
local-path-storage                  local-path-provisioner-8b46957d4-rgcjv                                1/1     Running   0          46s

If there are issues with a particular Pod, you can describe the Pod as follows to see any related events. This is from a deployment where the environment variable to skip certificate verification was not set, so I experienced X509 certificate issues, as you can see in the events below.

$ kubectl describe pod cert-manager-6bd4f58b67-zst7h -n cert-manager --kubeconfig .kube-tkg/tmp/config_JrrfmfML
Name:         cert-manager-6bd4f58b67-zst7h
Namespace:    cert-manager
Priority:     0
Node:         tkg-kind-bv2el4gcnnig5b03esv0-control-plane/172.17.0.3
Start Time:   Mon, 30 Nov 2020 12:51:19 +0000
Labels:       app=cert-manager
              app.kubernetes.io/component=controller
              app.kubernetes.io/instance=cert-manager
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=cert-manager
              helm.sh/chart=cert-manager-v0.16.1
              pod-template-hash=6bd4f58b67
Annotations:  prometheus.io/path: /metrics
              prometheus.io/port: 9402
              prometheus.io/scrape: true
Status:       Pending
IP:           10.244.0.3
IPs:
  IP:           10.244.0.3
Controlled By:  ReplicaSet/cert-manager-6bd4f58b67
.
.--<snip>
.
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  75s                default-scheduler  Successfully assigned cert-manager/cert-manager-6bd4f58b67-zst7h to tkg-kind-bv2el4gcnnig5b03esv0-control-plane
  Normal   Pulling    39s (x3 over 74s)  kubelet            Pulling image "cormac-tkgm.corinternal.com/library/cert-manager/cert-manager-controller:v0.16.1_vmware.1"
  Warning  Failed     39s (x3 over 74s)  kubelet            Failed to pull image "cormac-tkgm.corinternal.com/library/cert-manager/cert-manager-controller:v0.16.1_vmware.1": \
rpc error: code = Unknown desc = failed to pull and unpack image "cormac-tkgm.corinternal.com/library/cert-manager/cert-manager-controller:v0.16.1_vmware.1": \
failed to resolve reference "cormac-tkgm.corinternal.com/library/cert-manager/cert-manager-controller:v0.16.1_vmware.1": failed to do request: \
Head https://cormac-tkgm.corinternal.com/v2/library/cert-manager/cert-manager-controller/manifests/v0.16.1_vmware.1: x509: certificate signed by unknown authority
  Warning  Failed     39s (x3 over 74s)  kubelet            Error: ErrImagePull
  Normal   BackOff    11s (x4 over 73s)  kubelet            Back-off pulling image "cormac-tkgm.corinternal.com/library/cert-manager/cert-manager-controller:v0.16.1_vmware.1"
  Warning  Failed     11s (x4 over 73s)  kubelet            Error: ImagePullBackOff

Logging onto the Kind nodes

For advanced troubleshooting, you may also need to log onto the Kind node (docker container) to do further investigation. This is achieved by using docker exec -it <ID> bash, where ID is the ID of the Kind node/container. Once logged onto the Kind node/container, you can use ctr (the containerd cli) to manage container images. Thus you should be able to attempt ctr image pull command from within the Kind node against the Harbor repository to investigate any issues with image pulls, etc.

$ docker ps -a
CONTAINER ID        IMAGE                                                            COMMAND                  CREATED             STATUS                 PORTS                                         NAMES
cf853a6447e0        cormac-tkgm.corinternal.com/library/kind/node:v1.19.1_vmware.2   "/usr/local/bin/entr…"   4 minutes ago       Up 4 minutes           127.0.0.1:34695->6443/tcp                     tkg-kind-bv2hfq0cnnimqrerr300-control-plane
b1e3c3034c1c        goharbor/nginx-photon:v2.1.0                                     "nginx -g 'daemon of…"   4 hours ago         Up 4 hours (healthy)   0.0.0.0:80->8080/tcp, 0.0.0.0:443->8443/tcp   nginx
6c044e4844a6        goharbor/harbor-jobservice:v2.1.0                                "/harbor/entrypoint.…"   4 hours ago         Up 4 hours (healthy)                                                 harbor-jobservice
047889a4e74e        goharbor/harbor-core:v2.1.0                                      "/harbor/entrypoint.…"   4 hours ago         Up 4 hours (healthy)                                                 harbor-core
1f14a1e6aecc        goharbor/harbor-registryctl:v2.1.0                               "/home/harbor/start.…"   4 hours ago         Up 4 hours (healthy)                                                 registryctl
bffe8ac6bf75        goharbor/registry-photon:v2.1.0                                  "/home/harbor/entryp…"   4 hours ago         Up 4 hours (healthy)                                                 registry
6afad7a5504c        goharbor/harbor-db:v2.1.0                                        "/docker-entrypoint.…"   4 hours ago         Up 4 hours (healthy)                                                 harbor-db
5ab3f2045a34        goharbor/harbor-portal:v2.1.0                                    "nginx -g 'daemon of…"   4 hours ago         Up 4 hours (healthy)                                                 harbor-portal
02b2384abb73        goharbor/redis-photon:v2.1.0                                     "redis-server /etc/r…"   4 hours ago         Up 4 hours (healthy)                                                 redis
f37c1d9ac1ad        goharbor/harbor-log:v2.1.0                                       "/bin/sh -c /usr/loc…"   4 hours ago         Up 4 hours (healthy)   127.0.0.1:1514->10514/tcp                     harbor-log

$ docker exec -it cf853a6447e0 bash
root@tkg-kind-bv2hfq0cnnimqrerr300-control-plane:/# ctr

NAME:
   ctr -
        __
  _____/ /______
/ ___/ __/ ___/
/ /__/ /_/ /
\___/\__/_/

containerd CLI

USAGE:
   ctr [global options] command [command options] [arguments...]

VERSION:
   v1.3.3-14-g449e9269

DESCRIPTION
ctr is an unsupported debug and administrative client for interacting
with the containerd daemon. Because it is unsupported, the commands,
options, and operations are not guaranteed to be backward compatible or
stable from release to release of the containerd project.

Kudos

I’d like to close with a big thank you to my colleagues, Tom Schwaller and Keith Lee, for their guidance on this, especially on the certificate gotchas highlighted above. I’d also like to thank Chip Zoller for providing more details on the secure certificate methods. Thanks guys!