I’ve been working on some Disaster Recovery (DR) scenarios recently with my good pal Paudie. Last month we looked at how we might be able to protect vCenter Operation Manager, by using a vApp construct and also using IP customization. After VMworld, we turned our attention to NSX, and how we might be able to implement a DR solution for NSX. This is still a work in progress, but we did learn some very useful NSX troubleshooting commands that I thought would be worth sharing with you.
1. ping ++netstack=vxlan
If you are going to create a logical network across multiple clusters and perhaps different VLANs, it’s probably a good idea to verify that you can successfully reach all of the VTEPs (NSX VXLAN Tunnel End Point). NSX Controllers tell the VTEPs everything it needs to know to connect its physical ports to virtual networks. Use the ping ++netstack=vxlan command to do this. In my environment, vmk5 was the VMkernel NIC used for VXLAN encapsulation, so I need to be able to ping to hosts in another VLAN, and also the controllers to ensure everything was working. The command esxcfg-vmknic -l will list the VMkernel ports on an ESXi host, along with its IP address. Note that will also need to specify VXLAN as the netstack, or else the ping won’t work (see below):
~ # ping -I vmk5 172.24.150.100 Unknown interface 'vmk5': Invalid argument ~ # ping ++netstack=vxlan -I vmk5 172.24.150.100 PING 172.24.150.100 (172.24.150.100): 56 data bytes 64 bytes from 172.24.150.100: icmp_seq=0 ttl=60 time=125.298 ms 64 bytes from 172.24.150.100: icmp_seq=1 ttl=60 time=125.208 ms 64 bytes from 172.24.150.100: icmp_seq=2 ttl=60 time=125.059 ms --- 172.24.150.100 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 125.059/125.188/125.298 ms ~ #
2. Get controller status
When you can successfully ping all the VTEPs and NSX controllers, check that the VTEP successfully establishes a connection to the controller. Using the esxcli network vswitch dvs vmware vxlan network list –vds-name $vds-name command on the ESXi host, you can see the controller and the fact that it is up.
~ # esxcli network vswitch dvs vmware vxlan network list --vds-name mia VXLAN ID Multicast IP Control Plane \ -------- ------------------------ ---------------------------------- \ 5001 N/A (headend replication) Enabled (multicast proxy,ARP proxy) \
Controller Connection Port Count MAC Entry Count ARP Entry Count MTEP Count --------------------- ---------- --------------- ---------------- ---------- 172.24.150.106 (up) 2 0 0 0 ~ #
3. Port 1234
Another useful command is to check the network connections to make sure that the netcpa-worker on the ESXi host and the controller are communicating over TCP on port 1234. The following command will help you to do this, and you need to see that the connection is ESTABLISHED. If the state is SYN_SENT, it implies that there are additional communication issues to resolve.
~ # esxcli network ip connection list| grep tcp | grep 1234 tcp 0 0 172.25.133.17:41219 172.24.150.108:1234 ESTABLISHED .. tcp 0 0 172.25.133.17:22981 172.24.150.106:1234 ESTABLISHED .. ~ #
4. Packet Traces with pktcap-uw
ESXi ships with a packet capture utility called pktcap-uw. This allows you to capture packet traces from switch ports and up-links. When capturing traces from VXLAN, ensure that you specify the correct segment id via –vxlan <segment id>. This will allow you to trace packets leaving one ESXi host (perhaps the host where the NSX edge is providing a service like DHCP) and arriving into another ESXi host (perhaps the host where there is a VM that is trying to pick up an IP Address via DHCP). The VXLAN id can be found in the UI or from the esxcli command above. The later versions of Wireshark can now also display VXLAN transport information.
This next tip is very useful if you have a situation where all of your controllers get deleted and need to be redeployed.
This one had us scratching our heads quite a bit. We found this during DR testing, when all controllers were on site A, and then we failed over and we started NSX Manager up on site B. Therefore we needed to deploy all our controllers again. This involved deleting the original controllers on site A.
Because of this, the location of all of our existing VXLAN information is lost. Now if we try to pickup a DHCP address from the NSX Edge, it won’t work. We need to get the VXLANs to re-register with the new controllers. A quick way to do this is to toggle the logical switch from unicast or hybrid mode (which is the mode where controllers are needed) to multicast mode (which is the mode where controllers aren’t needed). This re-registers all of the MAC addresses of your VMs on the controllers. You only need to do this momentarily; flip the logical switch back to unicast/hybrid immediately afterwards. You will need to add a multicast range to your Segment ID pool first though; or else you will get the error: “Unable to allocate an available resource”. Now DHCP (and other services) should start to work again.
Hopefully you will find some of these troubleshooting tips useful. Kudos to Emiliano Turra for assistance with much of this.