VSAN Troubleshooting Case Study

I have been doing a bunch of stuff around disaster recovery (DR) recently, and my storage of choice at both the production site and the recovery site has been VSAN, VMware Virtual SAN. I have already done a number of tests already with products like vCenter Server, vCenter Operations Manager and NSX, our network virtualization product. Next up was VCO, our vCenter Orchestrator product. I set up vSphere Replication for my vCO servers (I deployed them in a HA configuration) and their associated SQL DB VM on Friday, but when I got in Monday morning, I could not log onto my vCenter. The problem was that my vCenter was running on VSAN (a bit of a chicken and egg type situation), so how do I troubleshoot this situation without my vCenter. And what was the actual problem? Was it a VSAN issue? This is what had to be done to resolve it.

First things first, I had to figure out if I had a VSAN/storage issue or a vCenter server issue. I logged onto the ESXi host that was running the vCenter VM. On looking at the error messages, I could see that vCenter had problems accessing it disks. This meant that it was time to turn our attention to VSAN.

I was fortunate in that I had a number of other vCenter servers around my lab, so I immediately added the 3 ESXi hosts that formed my VSAN cluster to a different vCenter. Before doing that I created a new cluster object in the new vCenter’s inventory and enabled VSAN, then added my hosts. Once I added the hosts to the VSAN cluster, I was in a position to examine the storage and try to figure out what was wrong with my original vCenter server (and my other VMs). What I noticed is that the vCenter server VM’s objects had a number of missing components – in fact I was missing a majority of components which meant that the VM could not come online. So what could be causing that? I then launched RVC, the Ruby vSphere Console, to try to figure out if VSAN just needed a ‘nudge’ to fix itself. To do this, we used the trusted RVC command vsan.check_state:

The command reported a whole bunch of components that were inaccessible, just like we’d observed in the UI. We tried the –refresh_state option to this command, just to see if it could give VSAN a nudge to figure out where its components were (in case we had a massive APD (all paths down) event or something similar over the weekend). We didn’t think it would make much difference, and we were correct.

Now that the cluster was recreated on the new vCenter, I went to the VSAN General tab, and under Network status, we saw the dreaded Misconfiguration detected message. So that explains why so many components were absent. The hosts in the cluster had difficulty in communicating with each other, and in fact the cluster had partitioned. Using esxcli vsan cluster list on the ESXi hosts, we saw that in fact none of the hosts were able to communicate, and that there was only a single host in each partition.

OK – time to figure out why the hosts were not able to communicate. The strange is that nothing had changed (to my knowledge) over the weekend. Our ops guys confirmed that there were no infrastructure changes either. So we started off by pinging each VSAN traffic VMkernel port on all the hosts using the ping -I command – all were good and responding. So the next possible cause seemed to be that this was a multicast issue. I then started to use some of the very useful multicast troubleshooting commands referenced by Christian Dickmann in his 2014 VMworld presentation (STO3098).

So I next examined the multicast setting used by my VSAN Cluster using the esxcli vsan network list command. Here I can see that the Agent Group Multicast address was 224.2.3.4 and the Master Group Multicast Address was 224.1.2.3. I then ran a tcpdump-uw command to monitor the heartbeats between the hosts on the agent group multicast address in the VSAN cluster. This uses UDP port 23451. The master should send this heartbeat every second and all other hosts in the VSAN cluster should see this. The output observed was something like this:

OK – we identified something weird here. 10.7.7.21 is the VSAN VMkernel network IP address of the host where I ran the command, but 10.7.7.11 and 10.7.7.14 were not in my VSAN cluster. These are the IP addresses of the VSAN network interface on ESXi hosts that are in a completely different VSAN cluster. And then we remembered KB article 2075451 which talks about setting up separate multicast addresses for VSAN cluster that might be on the same network. Could that be relevant? We set up our cluster on a new set of multicast IP addresses, so that the multicast addresses were unique to this cluster:

We couldn’t see any change initially, so we simply ran some ping commands once again to the VSAN VMkernel ports, and suddenly we could see the correct ESXi hosts in the tcpdump-uw output. We launched RVC once more, and ran the vsan.check_state command. This time everything looked good and we had no inaccessible issues:

Finally, we looked in the web client, and the Network status was now good:

And when I checked the components which made up the VMDKs for my vCenter server, these were all Active once again (no absent or inaccessible components):

I was then in a position to connect to my original vCenter server, disconnect the hosts from my temporary vCenter server and add them back to the original vCenter server. All was good once again.

So what was the root cause? We’re not sure. The environment was running like this for some time, and the clusters didn’t appear to have a problem. However we can only assume that having the same multicast addresses across both clusters on the same network led to the issue at hand. Once we separated the clusters to have their own unique multicast addresses, we were good to go.

Kudos again to my buddy, Paudie O’Riordan, for working with me on this issue.

Hopefully you will find those command useful when you have to troubleshoot VSAN network issues. I would also recommend watching Christian’s STO3098 session from VMworld for additional troubleshooting tips.

If you are interested in getting started with VSAN, the Essential VSAN book is now available via both Amazon and Pearsons.

9 Replies to “VSAN Troubleshooting Case Study”

John Kennedy says:

September 23, 2014 at 3:13 pm

What happens when the other cluster changes its multi cast address another collision occurs? Why not use VLANs? Easy for me to say, I guess Ive got Cisco Ucs, where the server admin controls VLANs.
1. Cormac says:
  
  September 24, 2014 at 8:38 am
  
  Indeed – that would have solved it. The issue is that we deployed multiple clusters on the same VLAN, and forgot to make the multicast change. Once we made the change so that each cluster had its own unique multicast addresses, we were good to go.
JamesM says:

September 23, 2014 at 10:23 pm

Maybe I’m confused but in your book, I remember it mentioning that if running multiple VSAN clusters, you have to turn on IGMP snooping and setup multiple multicast groups so that the multicast traffic is separated. Is the problem you had here along the same lines as what I’m talking about?
1. Cormac says:
  
  September 24, 2014 at 8:39 am
  
  Exactly, except we forgot to set the unique multicast group addresses per cluster.
JamesM says:

September 24, 2014 at 8:23 pm

One last question…when you moved the ESXi hosts into the new cluster (non-VSAN enabled) for testing purposes..did you enable VSAN automatic disk mode or manual?
1. Cormac says:
  
  September 25, 2014 at 8:45 am
  
  We had it in manual, but I don’t believe this would matter. The hosts remain part of a VSAN Cluster, so the cluster simply reforms without anything needing to be done at the disk level.
Timo Sugliani says:

September 29, 2014 at 12:41 pm

Hi Cormac,

Cool post, just for my own interest do we plan to allow the end user to have a choice using multicast or unicast in the future ? (like a vSphere advanced setting, like vsan.multicast 0/1 ?)

I think multicast is to avoid network overhead, but not sure the amount of network traffic/packet this generates, I suppose unicast would produce a larger overhead, but still nothing compared to the actual data traffic. (and in that sense acceptable)

Would be great having a comparison between both if this was actually implemented in the future.

Any ideas/insight ?
1. Cormac says:
  
  September 29, 2014 at 1:27 pm
  
  Erm, we can’t really talk about futures stuff Timo …
Pingback: Tips for a successful Virtual SAN (VSAN) Proof Of Concept (POC) | CormacHogan.com

Comments are closed.

VSAN Troubleshooting Case Study

Like this:

Published by Cormac

9 Replies to “VSAN Troubleshooting Case Study”