A closer look at vSphere with Tanzu networking with NSX-T

Cormac

2 years ago

This post continues to build on some of the other work already done on vSphere with Tanzu and NSX-T. In previous posts, we’ve seen how to setup NSX-T so it can be used by vSphere with Tanzu. The steps to install NSX-T Manager and prepare ESXi hosts was looked at in part 1. We saw how to set up an NSX-T Edge in part 2. Then in part 3, the steps to create a tier-0 gateway with BGP for dynamic routing shown. Most recently, the various NSX-T objects and services that are configured when the Supervisor cluster is deployed were examined. In this next post in the series, we are going to look at the communication between objects that are deployed in vSphere with Tanzu. To begin, we will create a standard Kubernetes pod on a TKG workload cluster that has been deployed in a vSphere Namespace. From there, a traceroute is run so we can see that various hops that a packet takes through the NSX-T platform to reach its destination. How this pod communicates to other objects within the same namespace is also looked at, as well as communication to a Load Balancer service in another vSphere namespace, such as the embedded Harbor Image Registry. Note that each of these vSphere Namespaces are configured with the default settings to use NAT on the tier-1 gateway. Whilst route-able pods are now optionally available for vSphere Namespaces since vSphere 7.0.3, we are not using them in this demonstration.

Topology Overview

Let’s begin with a view of the NSX-T network topology currently in place. At present, I have a single vSphere Namespace called cormac-ns. In it, there is a TKG cluster and a single podVM deployed. These share the same tier-1 gateway and associated services (Load Balancer, NAT and Firewall) provided by the NSX-T Edge Cluster. The tier-1 acts as a boundary for the vSphere Namespace. Each vSphere Namespace also get its own segment, but so does each TKG Cluster. This is why you can see two segments in the same vSphere Namespace below; one segment for the podVM and one for the TKG cluster. The TKG cluster has 1 control plane node and 2 worker nodes. All workloads (Pod VMs, VMs, TKG clusters) use the same SNAT/No SNAT policies and firewall rules applied at the tier-1. This allows these objects to communicate with each other. We shall see this in action shortly.

The final part to mention is the tier-0 gateway. This provides the connection to the physical, external network. It can be implemented using static or dynamic routing. In my setup, I have used BGP to implement dynamic routing.

Network namespaces

For a network trace to make sense, we need to have some idea of how a container within a Kubernetes node is able to communicate to the outside world. To do that, we need to leave the world of NSX-T for a moment and enter the world of the Container Network Interface (CNI). Two CNIs are supported on TKG clusters – Calico and VMware’s own Antrea. For this TKG cluster in the cormac-ns vSphere Namespace, Antrea is used as the CNI.

Each time a pod is created on a Kubernetes worker node, it also creates a network namespace on the node. A network namespace can be considered a logical copy of the network stack from the node. Thus, each network namespace gets its own range of IP addresses, network interfaces, routing tables, and so for. Multiple containers within the same Pod share the same network namespace.

In this example, I have created a simple busybox pod (single container) on one of my TKG worker nodes. To examine the networking details of a pod network namespace, the following command can be run after exec‘ing a shell session onto the pod/container. You can see the container IP address (192.168.1.3) listed. 192.168.0.0/16 is the default pods networks that the system creates when deploying TKG cluster using the Antrea (or Calico) CNI plugins.

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
      valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
      valid_lft forever preferred_lft forever
3: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
    link/ether f6:38:e6:86:d7:69 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.3/24 brd 192.168.1.255 scope global eth0
      valid_lft forever preferred_lft forever
    inet6 fe80::f438:e6ff:fe86:d769/64 scope link
      valid_lft forever preferred_lft forever

The default gateway on the pod is also query-able. We will see how this relates to the Antrea CNI in a moment.

/ # netstat -rn
Kernel IP routing table
Destination    Gateway        Genmask        Flags  MSS Window  irtt Iface
0.0.0.0        192.168.1.1    0.0.0.0        UG       0 0          0 eth0
192.168.5.0    0.0.0.0        255.255.255.0  U        0 0          0 eth0

We must now switch from the pod to the TKG worker node. This will allow us to find the network namespace for the pod. Note the cni- labels in the command below, meaning that network namespaces are being handled by the CNI driver, in this case Antrea.

$ ip netns list
cni-c45cf45b-cff1-f78e-0679-6b3ca1df357e (id: 1)
cni-5d4cf301-70dc-7f21-312c-518fe09d7085 (id: 0)

To detect the latest network namespace, you can list them as follows to see the creation dates. One date should coincide with the pod creation time. If there are no new networks created since the pod was created, you may be on the incorrect worker node. Check with kubectl describe pod to be sure.

$ ls -lt /var/run/netns
total 0
-r--r--r-- 1 root root 0 May 10 10:47 cni-c45cf45b-cff1-f78e-0679-6b3ca1df357e
-r--r--r-- 1 root root 0 May 10 10:30 cni-5d4cf301-70dc-7f21-312c-518fe09d7085

Identify the network namespace that corresponds to the pod. To examine the networking details of a pod network namespace, the following command can be run. You can see the container IP address (192.168.1.3) listed once again. Note how it corresponds to the IP address reported from within the container/pod in the previous step.

$ sudo ip netns exec cni-c45cf45b-cff1-f78e-0679-6b3ca1df357e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
      valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
      valid_lft forever preferred_lft forever
3: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 6e:e6:16:77:94:a6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.3/24 brd 192.168.1.255 scope global eth0
      valid_lft forever preferred_lft forever
    inet6 fe80::6ce6:16ff:fe77:94a6/64 scope link
      valid_lft forever preferred_lft forever

Note that this is the pod network namespace link. This link from the pod is connected to an Open vSwitch (OVS) bridge created by the Antrea CNI. To find the other end of the network that is connected to the OVS bridge, we can grep for the 7 taken from the interface eth0@if7. We can see it is referencing the same pod network namespace as shown above. This is the veth pair connecting the pod to the OVS bridge. The CNI (in this case the Antrea Agent) creates a veth pair for each Pod, with one end being in the pod’s network namespace and the other connected to the OVS bridge.

$ ip link | grep -A1 ^7
7: busybox1-9bdd44@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP mode DEFAULT group default
    link/ether 7a:42:d4:b4:97:bd brd ff:ff:ff:ff:ff:ff link-netns cni-220f06fc-b042-c007-fe1a-9dcea9294ec4


$ ip a show busybox1-9bdd44
7: busybox1-9bdd44e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default
    link/ether 7a:42:d4:b4:97:bd brd ff:ff:ff:ff:ff:ff link-netns cni-220f06fc-b042-c007-fe1a-9dcea9294ec4
    inet6 fe80::7842:d4ff:feb4:97bd/64 scope link
      valid_lft forever preferred_lft forever

Other interesting interfaces on the worker node are the eth0 (external link) on the worker node, and the Antrea gateway. The eth0 gets its IP address from the segment range that has been allocated by NSX-T to the TKG cluster, whereas the Antrea gateway matches the default gateway that we saw on the pod/container previously.

$ ip a show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 04:50:56:00:14:00 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.35/28 brd 10.244.0.47 scope global eth0
      valid_lft forever preferred_lft forever
    inet6 fe80::650:56ff:fe00:7c01/64 scope link
      valid_lft forever preferred_lft forever


$ ip a show antrea-gw0
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 82:0f:0d:0d:7f:70 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.1/24 brd 192.168.1.255 scope global antrea-gw0
      valid_lft forever preferred_lft forever
    inet6 fe80::800f:dff:fe0d:7f70/64 scope link
      valid_lft forever preferred_lft forever

I built a visualization to try to show how all of this ties together from within the context of a pod/container running on a worker node which is using the Antrea CNI:

I drew the antrea-tun0 interface as well, even though it is not being used. This tunnel would be used to allow pod-to-pod communication across TKG worker nodes in the same cluster.

Let’s now run a traceroute to google.com and see the hops. Note that I am only interested in seeing the packets reach the external physical router interface, so I have snipped the hops that are not relevant to either the CNI or NSX-T.

traceroute #1 – TKG pod to google.com

To run the trace, exec a shell onto the busybox pod (this is a standard K8s pod on the TKG workload cluster), and run the traceroute command from within the pod/container.

% kubectl exec -it busybox1 -- sh
/ # traceroute google.com
traceroute to google.com (142.250.189.238), 30 hops max, 46 byte packets
 1  192.168.1.1 (192.168.1.1)  0.004 ms  0.014 ms  0.003 ms
 2  10.244.0.33 (10.244.0.33)  0.252 ms  0.112 ms  0.059 ms
 3  100.64.136.0 (100.64.136.0)  0.286 ms  0.206 ms  0.165 ms
 4  192.168.105.253 (192.168.105.253)  0.737 ms  0.644 ms  0.491 ms
 5  .snip
 6  .snip
...
17  .snip
18  .snip
19  nuq04s39-in-f14.1e100.net (142.250.189.238)  28.721 ms  28.762 ms  28.760 ms

So the first hop from the pod is to the Antrea gateway 192.168.1.1. The packets are then routed out on the worker nodes eth0 interface, 10.244.0.35.

Once it leave the node via eth0, the next question is “what is hop 10.244.0.33“? This is one of the ranges taken from the 10.244.0.0/20 which is configured when setting up vSphere with Tanzu. As mentioned, each vSphere Namespace get its own tier-1, segment and IP address pool. Also, each TKG gets its own segment and IP address pool as well. This address here is the port on the tier-1 gateway for the cormac-ns namespace. It connects the segments used by the TKG cluster to the tier-1 gateway. This is the TKG cluster that has the worker node where the pod that ran the traceroute was created. 10.244.0.33 is thus the IP address representing the tier-1 gateway connection, and is where the packets hop to when they leave the eth0 interface on the worker node (IP address 10.244.0.35 as seen above). The tier-1 is created automatically when the namespace is created in vSphere with Tanzu, but this segment and connection to the tier-1 are created when the TKG cluster is created.

The next address hop is 100.64.136.0. As you can probably imagine at this point, this is the tier-0 gateway which the tier-1 gateway is connecting to. From there, the packet(s) are routed out onto the physical infrastructure and finds its way to its destination. Here is a fabric view from NSX-T Manager of one of the worker nodes, so we can see the tier-1/tier-0 connection, as well as the Edge cluster connections which make the connection to my physical, upstream router port. After that, it up to the physical infrastructure to route the packet correctly.

traceroute #2 – TKG pod to podVM in the vSphere Namespace

In this example, a trace is run from the K8s container/pod running on the TKG cluster, but this time rather than trying to reach an external site, we are simply going to try to reach a podVM in the same vSphere Namespace. Note that the podVM is on a segment that is connected to the same tier-1 but it is on a different segment to the TKG cluster. As you can see, this traceroute is successful.

/ # traceroute 10.244.0.18
traceroute to 10.244.0.18 (10.244.0.18), 30 hops max, 46 byte packets
1 192.168.1.1 (192.168.1.1) 0.353 ms 0.004 ms 0.003 ms
2 10.244.0.33 (10.244.0.33) 0.002 ms 0.071 ms 0.050 ms
3 10.244.0.18 (10.244.0.18) 0.228 ms 0.214 ms 0.165 ms
/ #

As has already been mentioned, all workloads – whether podVMs, VMs, or TKG clusters – use the same SNAT/No SNAT policies and firewall rules applied at the tier-1. This allows these objects to communicate with each other via distributed routing.

traceroute #3 – TKG pod to LB service in different vSphere Namespace

In this third trace, an attempt will be made to reach a load balancer service in another vSphere Namespace. I have setup the embedded Harbor Image registry service – something that is available in vSphere with Tanzu and NSX-T. This sets up Harbor in its own vSphere Namespace. Thus, this test should attempt to send packets to a different vSphere Namespace without needing to go outside the NSX-T platform.

/ # traceroute 10.203.182.134
traceroute to 10.203.182.134 (10.203.182.134), 30 hops max, 46 byte packets
1 192.168.1.1 (192.168.1.1) 0.004 ms 0.004 ms 0.003 ms
2 10.244.0.33 (10.244.0.33) 0.182 ms 0.075 ms 0.051 ms
3 100.64.144.0 (100.64.144.0) 0.154 ms 0.150 ms 0.140 ms
4 10.203.182.134 (10.203.182.134) 0.130 ms 0.188 ms 0.137 ms
/ #

Once more, the traceroute has completed successfully. This time we hit the tier-0 router, as represented by the 100.64.144.0 IP address. However, the packet was then routed to the Load Balancer service hosted on the tier-1 gateway associated with the Harbor Image Registry in a different vSphere Namespace. We can also think of this Load Balancer service being provided by the NSX Edge. The packets left the podVM namespace through its tier-1, reached the tier-0, and was routed to the Load Balancer service on Harbor’s tier-1 (at least, that is how I look at it). Note that the packet did not leave the NSX-T platform.

traceroute #4 – podVM to podVM in different vSphere Namespace

We shall do a final test, and that is to try to reach a podVM in a different vSphere Namespace. This should not be allowable since the unique tier-1 gateways should act as a boundary for the different vSphere Namespaces. The PodVMs are not exposed outside of the vSphere Namespace. And that is exactly what happens here:

root [ / ]# traceroute 10.244.0.67
traceroute to 10.244.0.67 (10.244.0.67), 30 hops max, 60 byte packets
 1 10.244.0.17 (10.244.0.17) 0.750 ms 0.714 ms 0.654 ms
 2 100.64.144.0 (100.64.144.0) 0.406 ms 0.408 ms 0.404 ms
 3 100.64.200.1 (100.64.200.1) 0.494 ms 0.476 ms 0.470 ms
 4 * * *
 5 * * *
 6 * * *
 7 * * *
 8 * * *
 9 *^C

The 100.64.200.1 IP address is the tier-1 gateway of the vSphere Namespace where the destination podVM resides, but access is not allowed, as expected.

Hopefully these examples provide you with some insights into how NSX-T networking and vSphere with Tanzu interoperate.