I have been doing a bunch of stuff around disaster recovery (DR) recently, and my storage of choice at both the production site and the recovery site has been VSAN, VMware Virtual SAN. I have already done a number of tests already with products like vCenter Server, vCenter Operations Manager and NSX, our network virtualization product. Next up was VCO, our vCenter Orchestrator product. I set up vSphere Replication for my vCO servers (I deployed them in a HA configuration) and their associated SQL DB VM on Friday, but when I got in Monday morning, I could not log onto my vCenter. The problem was that my vCenter was running on VSAN (a bit of a chicken and egg type situation), so how do I troubleshoot this situation without my vCenter. And what was the actual problem? Was it a VSAN issue? This is what had to be done to resolve it.
First things first, I had to figure out if I had a VSAN/storage issue or a vCenter server issue. I logged onto the ESXi host that was running the vCenter VM. On looking at the error messages, I could see that vCenter had problems accessing it disks. This meant that it was time to turn our attention to VSAN.
I was fortunate in that I had a number of other vCenter servers around my lab, so I immediately added the 3 ESXi hosts that formed my VSAN cluster to a different vCenter. Before doing that I created a new cluster object in the new vCenter’s inventory and enabled VSAN, then added my hosts. Once I added the hosts to the VSAN cluster, I was in a position to examine the storage and try to figure out what was wrong with my original vCenter server (and my other VMs). What I noticed is that the vCenter server VM’s objects had a number of missing components – in fact I was missing a majority of components which meant that the VM could not come online. So what could be causing that? I then launched RVC, the Ruby vSphere Console, to try to figure out if VSAN just needed a ‘nudge’ to fix itself. To do this, we used the trusted RVC command vsan.check_state:
The command reported a whole bunch of components that were inaccessible, just like we’d observed in the UI. We tried the –refresh_state option to this command, just to see if it could give VSAN a nudge to figure out where its components were (in case we had a massive APD (all paths down) event or something similar over the weekend). We didn’t think it would make much difference, and we were correct.
Now that the cluster was recreated on the new vCenter, I went to the VSAN General tab, and under Network status, we saw the dreaded Misconfiguration detected message. So that explains why so many components were absent. The hosts in the cluster had difficulty in communicating with each other, and in fact the cluster had partitioned. Using esxcli vsan cluster list on the ESXi hosts, we saw that in fact none of the hosts were able to communicate, and that there was only a single host in each partition.
OK – time to figure out why the hosts were not able to communicate. The strange is that nothing had changed (to my knowledge) over the weekend. Our ops guys confirmed that there were no infrastructure changes either. So we started off by pinging each VSAN traffic VMkernel port on all the hosts using the ping -I command – all were good and responding. So the next possible cause seemed to be that this was a multicast issue. I then started to use some of the very useful multicast troubleshooting commands referenced by Christian Dickmann in his 2014 VMworld presentation (STO3098).
So I next examined the multicast setting used by my VSAN Cluster using the esxcli vsan network list command. Here I can see that the Agent Group Multicast address was 18.104.22.168 and the Master Group Multicast Address was 22.214.171.124. I then ran a tcpdump-uw command to monitor the heartbeats between the hosts on the agent group multicast address in the VSAN cluster. This uses UDP port 23451. The master should send this heartbeat every second and all other hosts in the VSAN cluster should see this. The output observed was something like this:
OK – we identified something weird here. 10.7.7.21 is the VSAN VMkernel network IP address of the host where I ran the command, but 10.7.7.11 and 10.7.7.14 were not in my VSAN cluster. These are the IP addresses of the VSAN network interface on ESXi hosts that are in a completely different VSAN cluster. And then we remembered KB article 2075451 which talks about setting up separate multicast addresses for VSAN cluster that might be on the same network. Could that be relevant? We set up our cluster on a new set of multicast IP addresses, so that the multicast addresses were unique to this cluster:
We couldn’t see any change initially, so we simply ran some ping commands once again to the VSAN VMkernel ports, and suddenly we could see the correct ESXi hosts in the tcpdump-uw output. We launched RVC once more, and ran the vsan.check_state command. This time everything looked good and we had no inaccessible issues:
So what was the root cause? We’re not sure. The environment was running like this for some time, and the clusters didn’t appear to have a problem. However we can only assume that having the same multicast addresses across both clusters on the same network led to the issue at hand. Once we separated the clusters to have their own unique multicast addresses, we were good to go.
Kudos again to my buddy, Paudie O’Riordan, for working with me on this issue.
Hopefully you will find those command useful when you have to troubleshoot VSAN network issues. I would also recommend watching Christian’s STO3098 session from VMworld for additional troubleshooting tips.