First things first, I had to figure out if I had a VSAN/storage issue or a vCenter server issue. I logged onto the ESXi host that was running the vCenter VM. On looking at the error messages, I could see that vCenter had problems accessing it disks. This meant that it was time to turn our attention to VSAN.
I was fortunate in that I had a number of other vCenter servers around my lab, so I immediately added the 3 ESXi hosts that formed my VSAN cluster to a different vCenter. Before doing that I created a new cluster object in the new vCenter’s inventory and enabled VSAN, then added my hosts. Once I added the hosts to the VSAN cluster, I was in a position to examine the storage and try to figure out what was wrong with my original vCenter server (and my other VMs). What I noticed is that the vCenter server VM’s objects had a number of missing components – in fact I was missing a majority of components which meant that the VM could not come online. So what could be causing that? I then launched RVC, the Ruby vSphere Console, to try to figure out if VSAN just needed a ‘nudge’ to fix itself. To do this, we used the trusted RVC command vsan.check_state:
Now that the cluster was recreated on the new vCenter, I went to the VSAN General tab, and under Network status, we saw the dreaded Misconfiguration detected message. So that explains why so many components were absent. The hosts in the cluster had difficulty in communicating with each other, and in fact the cluster had partitioned. Using esxcli vsan cluster list on the ESXi hosts, we saw that in fact none of the hosts were able to communicate, and that there was only a single host in each partition.
OK – time to figure out why the hosts were not able to communicate. The strange is that nothing had changed (to my knowledge) over the weekend. Our ops guys confirmed that there were no infrastructure changes either. So we started off by pinging each VSAN traffic VMkernel port on all the hosts using the ping -I command – all were good and responding. So the next possible cause seemed to be that this was a multicast issue. I then started to use some of the very useful multicast troubleshooting commands referenced by Christian Dickmann in his 2014 VMworld presentation (STO3098).
So I next examined the multicast setting used by my VSAN Cluster using the esxcli vsan network list command. Here I can see that the Agent Group Multicast address was 22.214.171.124 and the Master Group Multicast Address was 126.96.36.199. I then ran a tcpdump-uw command to monitor the heartbeats between the hosts on the agent group multicast address in the VSAN cluster. This uses UDP port 23451. The master should send this heartbeat every second and all other hosts in the VSAN cluster should see this. The output observed was something like this:
So what was the root cause? We’re not sure. The environment was running like this for some time, and the clusters didn’t appear to have a problem. However we can only assume that having the same multicast addresses across both clusters on the same network led to the issue at hand. Once we separated the clusters to have their own unique multicast addresses, we were good to go.
Kudos again to my buddy, Paudie O’Riordan, for working with me on this issue.
Hopefully you will find those command useful when you have to troubleshoot VSAN network issues. I would also recommend watching Christian’s STO3098 session from VMworld for additional troubleshooting tips.