Some useful tips when deploying TKG in an air-gap environment
Recently I have been looking at deploying Tanzu Kubernetes Grid (TKG) in air-gapped or internet restricted environments. Interestingly, we offer different procedures for TKG v1.3 and TKG v1.4. In TKG v1.3, we pull the TKG images one at a time from the external VMware registry, and immediately push them up to an internal Harbor registry. In TKG v1.4, there is a different approach whereby all the images are first downloaded (in tar format) onto a workstation that has internet access. These images are then securely copied to the TKG jumpbox workstation, and from there, they are uploaded to the local Harbor registry so that they can be used for creating the TKG clusters. Although the solution is a little different in both cases, much of the steps are the same. In this post, I wanted to highlight a few tips and tricks to make you successful with TKG air-gap / internet restricted deployments.
1. Time Sync / NTP
My first recommendation is to ensure that time is synchronized between jumpbox and ESXi hosts where TKG will be deployed. I encountered a very strange error when the time on my jumpbox was out of sync with the TKG control plane VM. The TKG VM was deployed to an ESXi host where NTP was not enabled, and the time was out by approx. 4-5 minutes. The TKG cluster failed to deploy, and when I SSH’ed to the control plane node, I observed the following errors in the /var/log/cloud-init-output.log:
[2021-11-11 10:04:05] W1111 10:04:05.332126 1049 certs.go:489] WARNING: could not validate bounds for certificate CA: \ the certificate is not valid yet: NotBefore: 2021-11-11 10:11:28 +0000 UTC, NotAfter: 2031-11-09 10:16:28 +0000 UTC [2021-11-11 10:04:05] W1111 10:04:05.332237 1049 certs.go:489] WARNING: could not validate bounds for certificate front-proxy CA: \ the certificate is not valid yet: NotBefore: 2021-11-11 10:11:28 +0000 UTC, NotAfter: 2031-11-09 10:16:28 +0000 UTC
I then listed all Kubernetes containers running in cri-o/containerd on the TKG management control plane node using crictl:
# crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause W1111 10:19:51.112026 5173 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock". W1111 10:19:51.112556 5173 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock". 0f082c283175c 32e0a5740f67c 7 minutes ago Running kube-apiserver 6 bfe10740ce5cc 6c2ba3fbfa4e0 32e0a5740f67c 10 minutes ago Exited kube-apiserver 5 bfe10740ce5cc be931b8ceddd2 05d7f1f146f50 15 minutes ago Running kube-vip 0 899bb94f2974f 59d7c4f6f605f 510bf85cc5636 15 minutes ago Running kube-scheduler 0 209a3a6cbb2ba e8cb0f442eeed 05256fc9ff2a7 15 minutes ago Running kube-controller-manager 0 e215b867c78c0
And then used the following command to display the logs of the kube-apiserver containers on the management cluster node, seeing as that was having some problems initializing:
# crictl --runtime-endpoint /var/run/containerd/containerd.sock logs 6c2ba3fbfa4e0 . . I1111 10:09:24.774559 1 client.go:360] parsed scheme: "endpoint" I1111 10:09:24.774591 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}] W1111 10:09:24.777943 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. \ Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: \ current time 2021-11-11T10:09:24Z is before 2021-11-11T10:11:29Z". Reconnecting... I1111 10:09:25.771192 1 client.go:360] parsed scheme: "endpoint"
Once I configured all of the ESXi hosts to use NTP, the Network Time Protocol, and ensured that the time on the jumpbox matched the time on the ESXi hosts, and subsequently the TKG control plane VMs that were deployed to those ESXi hosts, the TKG management cluster deployed successfully.
2. x509: certificate signed by unknown authority
In the TKG v1.3 air-gapped guide, there is a section which discusses implementing the TKG_CUSTOM_IMAGE_REPOSITORY_CA_CERTIFICATE environment variable when your private Docker registry has been setup using a self-signed cert. The publish-images script which pushes the images to the local image registry is supposed to use this environment variable to establish trust to the image registry. Unfortunately, the script does not pick this up this environment variable at present. Instead, you will need to include the option –registry-ca-cert-path to the imgpkg command for trust to be established to the registry for pushing images to the internal Harbor image registry. Here is an example of what you might observe without this setting in place, where harbor-for-tkgv14.corinternal.com is my local Harbor image registry :
$ imgpkg copy --tar tkg-bom-v1.4.0.tar --to-repo harbor-for-tkgv14.corinternal.com/library/tkg-bom copy | importing 1 images... 0 B / ? [-----------------------------------------------------------------------------------------------------------------------=] 0.00% 2562047h47m16s copy | done uploading images Error: Retried 5 times: Get "https://harbor-for-tkgv14.corinternal.com/v2/": x509: certificate signed by unknown authority
It was only when I specified the CA cert from the image registry with the imgpkg command that it would successfully work.
$ imgpkg copy --tar tkg-bom-v1.4.0.tar --to-repo harbor-for-tkgv14.corinternal.com/library/tkg-bom --registry-ca-cert-path ./harbor-ca.crt copy | importing 1 images... 697 B / 2.86 KiB [==========================>-------------------------------------------------------------------------------------] 23.79% 6.84 KiB/s 0s copy | done uploading images Succeeded
Note that the instructions for TKG v1.4 do not include any reference to the environment variable. However, if you are using a self-signed certificate with the local / internal Harbor registry, you will certainly need to include the –registry-ca-cert-path option in the gen-images script to establish trust and avoid the x509 error shown earlier. This issue has already been reported and we should see it resolved in a future release.
3. could not fetch manifest from repository “core”
This one is more of an annoyance than anything, but when you begin to use the tanzu CLI on the air-gapped jumpbox which does not have any connection to the internet, you will observe a message similar to the following:
$ tanzu plugin list ! Unable to query remote plugin repositories : could not fetch manifest from repository "core": \ Get "https://storage.googleapis.com/tanzu-cli-tkg/artifacts/manifest.yaml": context deadline exceeded
As you can see, there is an attempt to get a manifest from an external repository, something which is not possible from the jumpbox since it is air-gapped. Its not such a big deal, but it does mean that there is a short delay in the tanzu command responding whilst it realizes that it cannot get that manifest. It is easily resolved though, by simply removing the core repository.
$ tanzu config show Command "show" is deprecated, will be removed in version "1.5.0". Use "get" instead apiVersion: config.tanzu.vmware.com/v1alpha1 clientOptions: cli: repositories: - gcpPluginRepository: bucketName: tanzu-cli-tkg name: core unstableVersionSelector: none kind: ClientConfig metadata: creationTimestamp: null $ tanzu plugin repo delete core $ tanzu config show Command "show" is deprecated, will be removed in version "1.5.0". Use "get" instead apiVersion: config.tanzu.vmware.com/v1alpha1 clientOptions: cli: unstableVersionSelector: none kind: ClientConfig metadata: creationTimestamp: null $ tanzu plugin list NAME LATEST VERSION DESCRIPTION REPOSITORY VERSION STATUS cluster Kubernetes cluster operations v1.4.0 installed kubernetes-release Kubernetes release operations v1.4.0 installed login Login to the platform v1.4.0 installed management-cluster Kubernetes management cluster operations v1.4.0 installed package Tanzu package management v1.4.0 installed pinniped-auth Pinniped authentication operations (usually not directly invoked) v1.4.0 installed
Hope you find these tips useful.
Great article!