Some useful tips when deploying TKG in an air-gap environment

Cormac

3 years ago

Recently I have been looking at deploying Tanzu Kubernetes Grid (TKG) in air-gapped or internet restricted environments. Interestingly, we offer different procedures for TKG v1.3 and TKG v1.4. In TKG v1.3, we pull the TKG images one at a time from the external VMware registry, and immediately push them up to an internal Harbor registry. In TKG v1.4, there is a different approach whereby all the images are first downloaded (in tar format) onto a workstation that has internet access. These images are then securely copied to the TKG jumpbox workstation, and from there, they are uploaded to the local Harbor registry so that they can be used for creating the TKG clusters. Although the solution is a little different in both cases, much of the steps are the same. In this post, I wanted to highlight a few tips and tricks to make you successful with TKG air-gap / internet restricted deployments.

1. Time Sync / NTP

My first recommendation is to ensure that time is synchronized between jumpbox and ESXi hosts where TKG will be deployed. I encountered a very strange error when the time on my jumpbox was out of sync with the TKG control plane VM. The TKG VM was deployed to an ESXi host where NTP was not enabled, and the time was out by approx. 4-5 minutes. The TKG cluster failed to deploy, and when I SSH’ed to the control plane node, I observed the following errors in the /var/log/cloud-init-output.log:

[2021-11-11 10:04:05] W1111 10:04:05.332126    1049 certs.go:489] WARNING: could not validate bounds for certificate CA: \
the certificate is not valid yet: NotBefore: 2021-11-11 10:11:28 +0000 UTC, NotAfter: 2031-11-09 10:16:28 +0000 UTC
[2021-11-11 10:04:05] W1111 10:04:05.332237    1049 certs.go:489] WARNING: could not validate bounds for certificate front-proxy CA: \
the certificate is not valid yet: NotBefore: 2021-11-11 10:11:28 +0000 UTC, NotAfter: 2031-11-09 10:16:28 +0000 UTC

I then listed all Kubernetes containers running in cri-o/containerd on the TKG management control plane node using crictl:

# crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause
W1111 10:19:51.112026    5173 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
W1111 10:19:51.112556    5173 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
0f082c283175c      32e0a5740f67c      7 minutes ago       Running            kube-apiserver            6                  bfe10740ce5cc
6c2ba3fbfa4e0      32e0a5740f67c      10 minutes ago      Exited             kube-apiserver            5                  bfe10740ce5cc
be931b8ceddd2      05d7f1f146f50      15 minutes ago      Running            kube-vip                  0                  899bb94f2974f
59d7c4f6f605f      510bf85cc5636      15 minutes ago      Running            kube-scheduler            0                  209a3a6cbb2ba
e8cb0f442eeed      05256fc9ff2a7      15 minutes ago      Running            kube-controller-manager  0                  e215b867c78c0

And then used the following command to display the logs of the kube-apiserver containers on the management cluster node, seeing as that was having some problems initializing:

# crictl --runtime-endpoint /var/run/containerd/containerd.sock logs 6c2ba3fbfa4e0
.
.
I1111 10:09:24.774559      1 client.go:360] parsed scheme: "endpoint"
I1111 10:09:24.774591      1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
W1111 10:09:24.777943      1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. \
Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: \
current time 2021-11-11T10:09:24Z is before 2021-11-11T10:11:29Z". Reconnecting...
I1111 10:09:25.771192      1 client.go:360] parsed scheme: "endpoint"

Once I configured all of the ESXi hosts to use NTP, the Network Time Protocol, and ensured that the time on the jumpbox matched the time on the ESXi hosts, and subsequently the TKG control plane VMs that were deployed to those ESXi hosts, the TKG management cluster deployed successfully.

2. x509: certificate signed by unknown authority

In the TKG v1.3 air-gapped guide, there is a section which discusses implementing the TKG_CUSTOM_IMAGE_REPOSITORY_CA_CERTIFICATE environment variable when your private Docker registry has been setup using a self-signed cert. The publish-images script which pushes the images to the local image registry is supposed to use this environment variable to establish trust to the image registry. Unfortunately, the script does not pick this up this environment variable at present. Instead, you will need to include the option –registry-ca-cert-path to the imgpkg command for trust to be established to the registry for pushing images to the internal Harbor image registry. Here is an example of what you might observe without this setting in place, where harbor-for-tkgv14.corinternal.com is my local Harbor image registry :

$ imgpkg copy --tar tkg-bom-v1.4.0.tar --to-repo harbor-for-tkgv14.corinternal.com/library/tkg-bom
copy | importing 1 images...

0 B / ? [-----------------------------------------------------------------------------------------------------------------------=] 0.00% 2562047h47m16s

copy | done uploading images
Error: Retried 5 times: Get "https://harbor-for-tkgv14.corinternal.com/v2/": x509: certificate signed by unknown authority

It was only when I specified the CA cert from the image registry with the imgpkg command that it would successfully work.

$ imgpkg copy --tar tkg-bom-v1.4.0.tar --to-repo harbor-for-tkgv14.corinternal.com/library/tkg-bom --registry-ca-cert-path ./harbor-ca.crt
copy | importing 1 images...

697 B / 2.86 KiB [==========================>-------------------------------------------------------------------------------------] 23.79% 6.84 KiB/s 0s

copy | done uploading images
Succeeded

Note that the instructions for TKG v1.4 do not include any reference to the environment variable. However, if you are using a self-signed certificate with the local / internal Harbor registry, you will certainly need to include the –registry-ca-cert-path option in the gen-images script to establish trust and avoid the x509 error shown earlier. This issue has already been reported and we should see it resolved in a future release.

3. could not fetch manifest from repository “core”

This one is more of an annoyance than anything, but when you begin to use the tanzu CLI on the air-gapped jumpbox which does not have any connection to the internet, you will observe a message similar to the following:

$ tanzu plugin list
! Unable to query remote plugin repositories : could not fetch manifest from repository "core": \
Get "https://storage.googleapis.com/tanzu-cli-tkg/artifacts/manifest.yaml": context deadline exceeded

As you can see, there is an attempt to get a manifest from an external repository, something which is not possible from the jumpbox since it is air-gapped. Its not such a big deal, but it does mean that there is a short delay in the tanzu command responding whilst it realizes that it cannot get that manifest. It is easily resolved though, by simply removing the core repository.

$ tanzu config show
Command "show" is deprecated, will be removed in version "1.5.0". Use "get" instead
apiVersion: config.tanzu.vmware.com/v1alpha1
clientOptions:
cli:
repositories:
- gcpPluginRepository:
bucketName: tanzu-cli-tkg
name: core
unstableVersionSelector: none
kind: ClientConfig
metadata:
creationTimestamp: null


$ tanzu plugin repo delete core


$ tanzu config show
Command "show" is deprecated, will be removed in version "1.5.0". Use "get" instead
apiVersion: config.tanzu.vmware.com/v1alpha1
clientOptions:
cli:
unstableVersionSelector: none
kind: ClientConfig
metadata:
creationTimestamp: null


$ tanzu plugin list
NAME               LATEST VERSION DESCRIPTION                                           REPOSITORY  VERSION  STATUS
cluster                           Kubernetes cluster operations                                     v1.4.0   installed
kubernetes-release                Kubernetes release operations                                     v1.4.0   installed
login                             Login to the platform                                             v1.4.0   installed
management-cluster                Kubernetes management cluster operations                          v1.4.0   installed
package                           Tanzu package management                                          v1.4.0   installed
pinniped-auth                     Pinniped authentication operations (usually not directly invoked) v1.4.0   installed

Hope you find these tips useful.