I have been spending a lot of time recently on vSphere with Tanzu and NSX-T. One of the tasks that I want to do is perform a network trace from a pod running on a TKG worker node. This will be for a future post. However, before running the trace, I need to secure shell (ssh) onto a TKG worker node in order to run the traceroute. This is more challenging with NSX-T compared to using vSphere networking. The reason why is because NSX-T provides “internal” network segments for the nodes which sit behind a tier-1 and tier-0 gateway. To be able to ssh to the TKG nodes, the easiest way is to create a “jumpbox” podVM in the same namespace, exec onto the jumpbox and run the ssh to the worker nodes from there. In this post I will show how to do this, including how to establish trust between the podVM and TKG cluster using SSH private keys so that you don’t have to provide any passwords when you ssh to the TKG nodes.
The TKG cluster and the podVM share the same tier-1 router as they are deployed in the same namespace. Although the podVM is allocated an IP address from a different range to the TKG cluster, and is placed on a different network segment, it is still within the same logical boundary created by the vSphere namespace. Thus, once the podVM is up and running, we can use kubectl to exec commands onto it. From there, after establishing trust between the podVM and the TKG Cluster nodes, we can use the same methodology to ssh onto the TKG nodes and perform the network trace.
Retrieve the SSH secrets for the TKG cluster
The first step is to retrieve the ssh secrets of the TKG cluster. To do that, ensure that the context is set to the vSphere namespace where the TKG cluster is deployed (this is not the TKG cluster namespace). In this example, the vSphere namespace where my TKG cluster is deployed is called cormac-ns, and the TKG cluster is called tkg-cluster-v1-21-6. This is the same namespace where I will deploy the podVM.
% kubectl config use-context cormac-ns Switched to context "cormac-ns". % kubectl get secret | grep ssh tkg-cluster-v1-21-6-ssh kubernetes.io/ssh-auth 1 3d20h tkg-cluster-v1-21-6-ssh-password Opaque 1 3d20h
Although we do not need to retrieve the actual password, it can be retrieved if necessary. I’ve included the steps to do that. First, display the ssh-password secret in YAML form, retrieve the base64 encoded password and then decode it.
% kubectl get secrets | grep ssh tkg-cluster-v1-21-6-ssh kubernetes.io/ssh-auth 1 3d21h tkg-cluster-v1-21-6-ssh-password Opaque 1 3d21h % kubectl get secrets tkg-cluster-v1-21-6-ssh-password -o yaml apiVersion: v1 data: ssh-passwordkey: VXJ2xxxxxxxx3WT0= kind: Secret metadata: creationTimestamp: "2022-05-05T11:53:05Z" name: tkg-cluster-v1-21-6-ssh-password namespace: cormac-ns ownerReferences: - apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster name: tkg-cluster-v1-21-6 uid: 24ef873e-9ea2-4006-9c88-5d8109999a8f resourceVersion: "7206726" selfLink: /api/v1/namespaces/cormac-ns/secrets/tkg-cluster-v1-21-6-ssh-password uid: 49bb4995-6e6a-4959-8fb4-7d4e7bd1f765 type: Opaque % echo "VXJ2xxxxxxxxxxx3WT0=" | base64 --decode Urv1xxxxx33wY=%
This is the password for the user vmware-system-user. If we were not setting up private key authentication, this password could be provided when the prompted for by the ssh session.
Create the jumpbox podVM
The next step is to create the podVM. The args section of the manifest might take some explanation. The manifest is for creating a pod with a Photon OS version 3.0 image, but it is also installing the openssh-server using yum. Once that step is complete, it creates a .ssh directory in /root on the podVM, and copies the ssh private key from the TKG cluster. It get access to the private key by mounting the TKG cluster secret which contains the private key as a volume to /root/ssh. Next, it copies the private key from /root/ssh/ssh-privatekey to /root/.ssh, and renames it id_rsa. This podVM is now authenticated which allows it to open a secure shell to the TKG cluster nodes without involving passwords.
Below is the manifest used to create the jumpbox. You will need to change the namespace and the secret name – highlighted in blue – for your environment. The secret name should be replaced by the name of your TKG cluster that you are trying to ssh to. This manifest is also defaulting to pulling the container image from docker hub. If you are pulling the image from some other registry, you will need to update the image to include the location.
apiVersion: v1 kind: Pod metadata: name: jumpbox namespace: cormac-ns spec: containers: - image: "photon:3.0" name: jumpbox command: [ "/bin/bash", "-c", "--" ] args: [ "yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;" ] volumeMounts: - mountPath: "/root/ssh" name: ssh-key readOnly: true resources: requests: memory: 2Gi volumes: - name: ssh-key secret: secretName: tkg-cluster-v1-21-6-ssh
We are now ready to create the podVM.
% kubectl apply -f jumpbox.yaml pod/jumpbox created % kubectl get pods NAME READY STATUS RESTARTS AGE jumpbox 1/1 Running 0 3m11s
Success! The jumpbox podVM is up and running in the same namespace as the TKG cluster. It has also been allocated an IP address from the IP address pool that was created by NSX-T for this namespace.
Under the covers, it is attached to the same tier-1 gateway as the TKG cluster. We can verify this by viewing the tier-1 topology from NSX-T Manager. On the bottom left is the jumpbox PodVM and on the right are the TKG cluster nodes. Note the different segments and the different IP address ranges. The Gateway Firewall Services and NAT Rules associated with the tier-1 gateway allow these network objects (PodVM and TKG nodes) to communicate to one another.
We should now be able to ssh to the TKG worker nodes from the podVM without providing a password.
SSH to podVM and TKG worker node
There are a number of ways to do this. First, I am going to do this in a 2 step approach. First, I will exec a bash shell session on the podVM, and from there I will ssh to the TKG cluster worker node. First, we need the IP addresses of the TKG nodes. We can get this from the kubectl get nodes -o wide command. We will then use this IP address when we run the ssh command on the podVM, logging in as vmware-system-user.
% kubectl get virtualmachines -o wide NAME POWERSTATE CLASS IMAGE PRIMARY-IP AGE tkg-cluster-v1-21-6-control-plane-9fn5c poweredOn guaranteed-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 10.244.0.37 2d23h tkg-cluster-v1-21-6-worker-pool-1-x2szc-7ffbddd567-8ct8m poweredOn guaranteed-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 10.244.0.38 2d23h tkg-cluster-v1-21-6-worker-pool-1-x2szc-7ffbddd567-pn7sp poweredOn guaranteed-small ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a 10.244.0.39 2d23h % kubectl exec -it jumpbox -- bash root [ / ]# root [ / ]# ls /root/.ssh id_rsa known_hosts root [ / ]# /usr/bin/ssh firstname.lastname@example.org The authenticity of host '10.244.0.38 (10.244.0.38)' can't be established. ECDSA key fingerprint is SHA256:k6yIaa69HW6CS6f5D4yh26rkjtHbmVlh5hTnyjFxfUY. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.244.0.38' (ECDSA) to the list of known hosts. Welcome to Photon 3.0 (\m) - Kernel \r (\l) 08:16:13 up 2 days, 23:45, 0 users, load average: 0.64, 0.66, 0.64 22 Security notice(s) Run 'tdnf updateinfo info' to see the details. ]$ are-system-user@tkg-cluster-v1-21-6-worker-pool-1-x2szc-7ffbddd567-8ct8m [ ~
And there we go. We are now ssh’ed onto a worker node. Note that there was no prompt for a password. We could also do the same thing as a single exec command rather than work interactively in the podVM.
% kubectl exec -it jumpbox -- /usr/bin/ssh email@example.com Welcome to Photon 3.0 (\m) - Kernel \r (\l) Last login: Mon May 9 08:18:28 2022 from 10.244.0.18 08:42:16 up 3 days, 11 min, 0 users, load average: 0.72, 0.71, 0.66 22 Security notice(s) Run 'tdnf updateinfo info' to see the details. <
You might have noticed that the shell prompt on the TKG worker node is a little jumbled in each session. It seems to be something to do with the attempt to set a smart prompt. To avoid this, I simply set the environment variable for the prompt to something simple via the PS1=”$ “; export PS1 command. That seems to tidy it up for me.
Now that I am able to access the TKG worker nodes, I can run my network traces from the pod and examine some of the networking configuration on the node. I’ll do another post on that shortly.
Finally note that these steps are not necessary for vSphere with Tanzu using the NSX Advanced Load Balancer (NSX-ALB) since the TKG nodes will be deployed on a vSphere network in that case. You can just ssh directly to those nodes using the techniques outlined here without deploying a jumpbox podVM.
vSphere with Tanzu is available with NSX-T networking for both on-premises deployments as well as cloud deployments through Tanzu Services on VMware Cloud on AWS.