Integrating embedded vSphere with Kubernetes Harbor Registry with TKG (guest) clusters

Cormac

4 years ago

A number of readers have hit me up with queries around how they can use the integrated Harbor image repository (that comes integrated with vSphere with Kubernetes) for applications that are deployed on their Tanzu Kubernetes Grid clusters, sometimes referred to as guest clusters. Unfortunately, there is no defined workflow on how to achieve this. The reason for this is that there are a number of additional life-cycle management considerations that we need to take into account before we can fully integrate these components. This includes adding new TKG nodes to the image registry as a TKG cluster is scaled.

Thus, what I am about to show you is ** unsupported ** today. The reason I am documenting it here is that I know a lot of customers and partners are interested in this process simply for a proof of concept. But please note that this integration should not be done in a production environment. You will not be supported. We are already working on a way to introduce this integration as a simple user experience in a future release. If you wish to implement this procedure, you do so at your own risk.

Without this procedure, any attempt to pull an image from the Harbor Image Registry will fail with the following Pod events:

Normal   Pulling         <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg \
 Pulling image "20.0.0.2/demo-ns/cassandra:v11"
Warning  Failed          <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg \
 Failed to pull image "20.0.0.2/demo-ns/cassandra:v11": rpc error: code = Unknown desc = Error response from daemon: \
 Get https://20.0.0.2/v2/: x509: certificate signed by unknown authority
Warning  Failed          <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg \
 Error: ErrImagePull
Normal   BackOff         <invalid> (x6 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg \
 Back-off pulling image "20.0.0.2/demo-ns/cassandra:v11"
Warning  Failed          <invalid> (x7 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg \
 Error: ImagePullBackOff

The x509: certificate signed by unknown authority basically means that the requester (TKG cluster worker node) does not have a valid certificate and is not trusted by the registry.

We can break the integration process into 4 steps.

Retrieve the Harbor Image Registry certificate from the Harbor UI
Push the certificate to the TKG cluster nodes
Create a Kubernetes secret which holds the Harbor Image Registry credentials
Include an ImagePullSecrets parameter in any Pod manifests which pulls an image from the Image Registry

Step 1 – Get Certificate from the Harbor Image Registry

Since Harbor is deployed via vSphere with Kubernetes, it is automatically added to the SSO domain. Simply login to Harbor with your SSO credentials (e.g. administrator@vsphere.local) select the namespace project where the TKG cluster is deployed and then select Repositories. Repositories are where the container images are stored. Here there is a link to download the Registry Certificate. Click on the link and save the certificate.

Step 2 – Push the registry certificate to the TKG cluster nodes

To begin with, you will initially need to be logged in to vSphere with Kubernetes at the namespace layer where the TKG cluster resides. Later we will change contexts and work at the TKG cluster layer.

There are a number of sub-steps to this step. These sub-steps can summarized as follows:

Fetch the secret to SSH into the TKG nodes
Fetch the kubeconfig file for the TKG cluster
Change contexts to the TKG cluster
Get the IP addresses from the TKG nodes
Copy the Image registry certificate to each nodes
Install the Image registry certificate to the node’s trust bundle
Restart docker on each of the nodes

Let’s now look at those steps in detail.

Step 2a – Fetch the SSH private key secret to SSH onto the TKG nodes

Once logged into the namespace where the TKG cluster is deployed (not logged into the TKG cluster itself), you must fetch the SSH secret for the TKG cluster that will enable login to the TKG nodes. In my example, the namespace is called demo-ns and the TKG cluster is called ch-tkg-cluster-01. The SSH private key has a naming convention of <cluster>-ssh. Thus, in my case, the SSH key secret is called ch-tkg-cluster01-ssh. The command to retrieve the SSH private key is as follows:

$ kubectl get secret -n demo-ns ch-tkg-cluster01-ssh \
-o jsonpath='{.data.ssh-privatekey}' | base64 -d

To make things easier later, store this private key in a file, e.g.

$ kubectl get secret -n demo-ns ch-tkg-cluster01-ssh \
-o jsonpath='{.data.ssh-privatekey}' | base64 -d > cluster-ssh

Step 2b – Fetch the kubeconfig file for the TKG cluster

To allow us to work at the TKG cluster level rather than the namespace level later on, get the kubeconfig for the cluster. Similar to the SSH key previously, the kubeconfig is in a secret called <cluster>-kubeconfig, so in my deployment it is called ch-tkg-cluster01-kubeconfig. The command to retrieve the kubeconfig is as follows:

$ kubectl get secret -n demo-ns ch-tkg-cluster01-kubeconfig \
-o jsonpath='{.data.value}' | base64 -d > cluster-kubeconfig

Step 2c – Switch to the TKG cluster

With the kubeconfig retrieved in the previous step, we can now switch from the namespace context to the TKG guest cluster context.

$ export KUBECONFIG=cluster-kubeconfig

You can verify that the context has changed by running a kubectl get nodes. We should now see the control plane and workers VMs of the TKG cluster.

$ kubectl get nodes
NAME                                              STATUS   ROLES    AGE     VERSION
ch-tkg-cluster01-control-plane-gc8b2              Ready    master   6d19h   v1.16.8+vmware.1
ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd   Ready    <none>   6d19h   v1.16.8+vmware.1
ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb   Ready    <none>   6d19h   v1.16.8+vmware.1
ch-tkg-cluster01-workers-2krnb-65fdb7455b-s2whg   Ready    <none>   6d19h   v1.16.8+vmware.1

Step 2d – Get the IP address of the TKG nodes

I used the following script to pick up the IP address of each of the TKG cluster nodes, and store them to a file called ip-list. There are multiple ways of achieving this – this is just one way.

$ for i in `kubectl get nodes --no-headers | awk '{print $1}'`
do 
kubectl get node $i -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' >> ip-list
echo >> ip-list
done


$ cat ip-list
10.244.0.242
10.244.0.244
10.244.0.245
10.244.0.243

Step 2e – Copy the Image registry certificate to each nodes

In this step, we need to copy the registry certificate over to each of the TKG nodes. We have the SSH private key in a file called cluster-ssh. I have also stored the registry certificate (downloaded in step 1) in a file called ca.crt in my current working directory. Thus, I can use the following command to copy the cert to each of the TKG nodes:

$ scp -i cluster-ssh ca.crt vmware-system-user@10.244.0.242:/home/vmware-system-user/registry_ca.crt

I could do this manually for each node, or I could wrap it in a script as follows (since I have the list of node IP addresses stored in a file called ip-list from the previous step):

$ for i in `cat ip-list`
do
scp -i cluster-ssh ca.crt vmware-system-user@${i}:/home/vmware-system-user/registry_ca.crt
done

Now that we have copied the registry certificate to each TKG node, as a last step we must add it to the trust bundle on each node, and then restart the docker service.

Step 2f – Add the registry certificate to the node’s trust bundle

The registry certificate is now on the TKG node, but it is not in the correct location. We can use the following command to place it in the correct location.

$  ssh -i cluster-ssh vmware-system-user@10.244.0.242 'sudo bash -c "cat /home/vmware-system-user/registry_ca.crt >> /etc/pki/tls/certs/ca-bundle.crt"'
The authenticity of host '10.244.0.242 (10.244.0.242)' can't be established.
ECDSA key fingerprint is SHA256:uMWEr+Fh+6bwBRImd1jfefTnMU7UvGSGOCZygbaBbtg.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.244.0.242' (ECDSA) to the list of known hosts.
Welcome to Photon 3.0 (\m) - Kernel \r (\l)

Again, rather than do this manually for every node, you could wrap it in the following script.

$ for i in `cat ip-list`
do
ssh -i cluster-ssh vmware-system-user@${i} \
'sudo bash -c "cat /home/vmware-system-user/registry_ca.crt >> /etc/pki/tls/certs/ca-bundle.crt"'
done

Step 2g – Restart the docker service on each node

The final part of this step is to restart docker. This can be done as follows:

$ ssh -i cluster-ssh vmware-system-user@10.244.0.242 'sudo systemctl restart docker.service'
Welcome to Photon 3.0 (\m) - Kernel \r (\l)

And as before, we can wrap this in a script for all nodes:

$ for i in `cat ip-list`
do
ssh -i cluster-ssh vmware-system-user@${i} 'sudo systemctl restart docker.service'
done

Combining sub-steps 2e, 2f and 2g

Now, I have simplified things greatly by creating 3 sub-steps for 2e, 2f and 2g. You could have placed all of those into a single script if you wish, but I separated them out to make the steps easier to follow. If you wish to combine the 3 sub-steps, you could do something similar to the following:

$ for i in `cat ip-list`
do
scp -i cluster-ssh ca.crt vmware-system-user@${i}:/home/vmware-system-user/registry_ca.crt
ssh -i cluster-ssh vmware-system-user@${i} \
'sudo bash -c "cat /home/vmware-system-user/registry_ca.crt >> /etc/pki/tls/certs/ca-bundle.crt"'
ssh -i cluster-ssh vmware-system-user@${i} 'sudo systemctl restart docker.service'
done

At this point, you might think that you have done enough to allow the TKG nodes to use the Harbor Image Registry. Unfortunately not. If you attempt to deploy an application where the Pod attempts to pull an image from the Harbor Image Registry, the Pod events no longer display the X509 error seen previously, but instead display the following failure:

Normal   BackOff         <invalid> (x6 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb
  Back-off pulling image "20.0.0.2/demo-ns/cassandra:v11"
Warning  Failed          <invalid> (x6 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb
  Error: ImagePullBackOff
Normal   Pulling         <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb
  Pulling image "20.0.0.2/demo-ns/cassandra:v11"
Warning  Failed          <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb
  Failed to pull image "20.0.0.2/demo-ns/cassandra:v11": rpc error: code = Unknown desc = Error response from daemon:
  pull access denied for 20.0.0.2/demo-ns/cassandra, repository does not exist or may require 'docker login'
Warning  Failed          <invalid> (x4 over <invalid>)  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-9rdkb
  Error: ErrImagePull

The clue is in the error “may require ‘docker login’“. We need to provide the Pods with Image Registry credentials so that they are able to do a docker login to retrieve the image. Let’s do that next.

Step 3 – Create a secret with Image Registry credentials

This step is described in detail in the Kubernetes documentation here. To begin, we need credentials from a valid docker login to the Harbor Image registry from a desktop/laptop. This creates a .docker/config.json file which holds credentials which can then be used to create a secret that your TKG Pods can use to access the image registry.

Here is my ~/.docker/config.json:

$ cat ~/.docker/config.json
{
        "auths": {
                "20.0.0.2": {
                        "auth": "YWRtaW5pc3RyYXRvckB2c3BoZXJlLmxvY2FsOlZNd2FyZTEyMyE="
                }
        }
}

20.0.0.2 is the IP address of my Harbor Image Registry. Yours may be different. The next step is to create a secret from this file:

$ kubectl create secret generic regcred \
> --from-file=.dockerconfigjson=/home/cormac/.docker/config.json \
> --type=kubernetes.io/dockerconfigjson
secret/regcred created

Query that the secret was successfully created:

$ kubectl get secret regcred --output=yaml
apiVersion: v1
data:
  .dockerconfigjson: ewoJImF1dGhzIjogewoJCSIyMC4wLjAuMiI6IHsKCQkJImF1dGgiOiAiWVdSdGFXNXBjM1J5WVhSdmNrQjJjM0JvWlhKbExteHZZMkZzT2xaTmQyRnlaVEV5TXlFPSIKCQl9Cgl9Cn0K
kind: Secret
metadata:
  creationTimestamp: "2020-06-23T07:37:03Z"
  name: regcred
  namespace: default
  resourceVersion: "1560917"
  selfLink: /api/v1/namespaces/default/secrets/regcred
  uid: f189d33f-ba20-41a5-9b33-6fdaf3618a5b
type: kubernetes.io/dockerconfigjson

Looks good. The last step is to modify our Pod manifests to include the secret, and of course to pull the container image from the Harbor Image Registry. I’m not going to show you how to push, tag and pull images to/from the registry – there are plenty of examples of that out there, including this blog.

Step 4 – Add secret to Pod manifest

A new entry is required in the Pod manifest so that when it pulls a manifest from an internal image registry, it also has a secret to allow it to login to the manifest. The entry is called spec.imagePullSecrets.name. Here is a sample manifest for a simple busybox which pulls its container image from my Harbor image repository and which also includes the secret.

$ cat busybox-cor.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ch-busybox
  labels:
    app: ch-busybox
spec:
  containers:
  - image: "20.0.0.2/demo-ns/busybox"
    command:
      - sleep
      - "3600"
    imagePullPolicy: Always
    name: busybox
 imagePullSecrets:
  - name: regcred
  restartPolicy: Always

And the final step – does it work? Can we now have a Pod on a TKG guest cluster pull a container image from the embedded Harbor Image Registry on vSphere with Kubernetes?

$ kubectl apply -f busybox-cor.yaml
pod/ch-busybox created

$ kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
ch-busybox   1/1     Running   0          5s

$ kubectl describe pod ch-busybox
Name:               ch-busybox
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd/10.244.0.244
Start Time:         Tue, 23 Jun 2020 08:41:39 +0100
Labels:             app=ch-busybox
Annotations:        cni.projectcalico.org/podIP: 192.168.65.131/32
                    cni.projectcalico.org/podIPs: 192.168.65.131/32
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"ch-busybox"},"name":"ch-busybox","namespace":"default"},"spe...
                    kubernetes.io/psp: vmware-system-privileged
Status:             Running
IP:                 192.168.65.131
Containers:
  busybox:
    Container ID:  docker://64155ea248f3d33c4afb7313e7cbd2819c00125c859cd8f435c4dc93094d67f5
    Image:         20.0.0.2/demo-ns/busybox
    Image ID:      docker-pullable://20.0.0.2/demo-ns/busybox@sha256:d2af0ba9eb4c9ec7b138f3989d9bb0c9651c92831465eae281430e2b254afe0d
    Port:          <none>
    Host Port:     <none>
    Command:
      sleep
      3600
    State:          Running
      Started:      Tue, 23 Jun 2020 08:41:41 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zv58f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-zv58f:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zv58f
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                                                      Message
  ----    ------     ----       ----                                                      -------
  Normal  Scheduled  <unknown>  default-scheduler                                         Successfully assigned default/ch-busybox to ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd
  Normal  Pulling    <invalid>  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd  Pulling image "20.0.0.2/demo-ns/busybox"
  Normal  Pulled     <invalid>  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd  Successfully pulled image "20.0.0.2/demo-ns/busybox"
  Normal  Created    <invalid>  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd  Created container busybox
  Normal  Started    <invalid>  kubelet, ch-tkg-cluster01-workers-2krnb-65fdb7455b-7v8wd  Started container busybox

Success! We have successfully pulled an image for our Pod running on a TKG (guest) cluster from the Harbor Image Registry integrated in vSphere with Kubernetes.

Now, at the risk of repeating myself, this is not supported. There are a number of life-cycle management activities that we need to work through before we can support it, and I will make the assumption that we will also make the integration easier than what I have shown you here. However, if you keep this in mind, and are only interested in doing some testing or a proof of concept with TKG clusters in vSphere with Kubernetes, then this procedure should help.

Finally, a word of thanks to Ross Kukulinski who gave me a bunch of pointers when I got stuck (which happened quite a lot during this exercise).