Debunking some behavior “myths” in 3 node vSAN cluster
I recently noticed a blog post describing some very strange behaviors in 2-node and 3-node vSAN clusters. I was especially concerned to read that when they introduced a failure and then fixed that failure, they did not experience any auto-recovery. I have reached out to the authors of the post, just to check out some things such as version of vSAN, type of failure, etc. Unfortunately I haven’t had a response as yet, but I did feel compelled to put the record straight. In the following post, I am going to introduce a variety of operations and failures in my 3-node cluster, and show you exactly how things are supposed to behave. I will look at maintenance mode behavior, network failure and a flash device failure. Please forgive some of the red markings in the health check screenshots below. I am using our latest and greatest version of vSAN, which has a number of new health check features that have yet to be announced. I hope to be able to share more in the very near future. But I did want to get this post out as quickly as possible to address those issues.
Maintenance Mode
Let’s start with the easy one – placing a host into maintenance mode. Now, with VMs deployed with the FailuresToTolerate = 1 default policy, 3 hosts are needed; first copy of the data, second copy of the data (RAID-1) and witness component. So if we place one of these hosts into maintenance mode, then we can no longer protect the VM against failures. Therefore when placing a host into maintenance mode, one has to select the “Ensure accessibility” option, as there are no additional hosts available in the cluster to re-protect the VM (i.e. a fourth host). When the host is in maintenance mode, the VM is running with 2 out of 3 components. The components on the host in maintenance mode go into what is known as Absent state. Let’s see what that looks like in the health check:
Reduced availability with no rebuild time – delay timer implies that vSAN know that some of the components of the objects are unavailable. However the VM remains accessible as 2 out of 3 of the components that make up the object are still available. For the purposes of this test, I waited for the clomd timer to expire. I observed that the component remained in Absent state. Then I exited the host from maintenance mode state. The components that were Absent became Active without any further intervention, i.e. this happened automatically.
I asked our engineering team, and they reported that this is the expected behavior all the way back to vSAN 5.5. An Absent component should not go Degraded.
Network Failure
In this next test, I wanted to cause a network outage on one of the nodes.
Note that it is not enough to simply disable the vSAN network service, as this will leave the physical network in place, and features like HA will think everything is fine and dandy. If you disable the vSAN network on the host where a VM runs, you’ll get a VM-centric view of the cluster from that host. This will show you the component from the host on which the VM runs, but you will not see the other hosts or other components. This means the VM will be displayed with 2 Absent components, and since HA thinks everything is fine, it won’t take any action to restart the VM on another host. And since the VM does not have a quorum of components, it will be inaccessible.
To implement a proper network failure, disable the physical uplinks on the host that were used by vSAN, as per KB article 2006074. In a nutshell, use the commands esxcli network nic down -n vmnicX (and esxcli network nic up -n vmnicX to fix the network). This once again led to one out of the three components of the VM’s objects going Absent. The health check displays the object with an Inaccessible state. However, this is just one of the components that is inaccessible. The VM itself, which still has 2 out of 3 components available, remains online and available.
The components remained in this state even after the clomd timer expired. As before, a 3-node cluster cannot self-heal (resync/rebuild absent components to another node), so we simply remain in this state until the network is fixed on the broken node. As before, as soon as I fixed the networking issue, the absent components automatically resynced without any additional intervention. All objects then becomes healthy once more.
Cache Device Failure
This is a pretty serious test. What I am testing here is a permanent disk failure on the cache device of a disk group. Those of you familiar with vSAN will know that this sort of failure impacts the whole of the disk group. However, as part of a Proof-Of-Concept, we have the ability to inject a permanent disk failure on a device. Please refer to the vSAN Proof-Of-Concept guide for further guidance. Obviously, never try this on a production system. First step is to identify the device, and then use the python script below to inject the error:
[root@esxi:/usr/lib/vmware/vsan/bin] python ./vsanDiskFaultInjection.pyc \ -p -d naa.55cd2e404c31f86d Injecting permanent error on device vmhba0:C0:T5:L0 vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1 vsish -e set /storage/scsifw/paths/vmhba0:C0:T5:L0/injectError 0x03110300000002 [root@esxi-dell-e:/usr/lib/vmware/vsan/bin]
This injects a permanent error on the device with that NAA id. It causes the device to go immediately into a Degraded state, not an Absent state. This is where vSAN knows that this device is “kaput”, and is not recovering. This is different to Absent, where the nature of the failure could in fact recover at some point (e.g. hot-unplugged disk, rebooted host). Since vSAN has detected a permanent failure of the device, it places it into Degraded state.
Now let’s take a look at the state of the object from the health check. Since the component is degraded, there is no waiting around for the clomd timer to expire. vSAN attempts to remediate the issue immediately. However, once more, since this is a 3-node cluster, there is no fourth node to self-heal. Therefore the object health is shown as Reduced availability with no rebuild time:
Now repairing this issue is not easy. The first step is to clear the permanent error.
[root@esxi:/usr/lib/vmware/vsan/bin] python ./vsanDiskFaultInjection.pyc \ -c -d naa.55cd2e404c31f86d Clearing errors on device vmhba0:C0:T5:L0 vsish -e set /storage/scsifw/paths/vmhba0:C0:T5:L0/injectError 0x00000 vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000 [root@esxi:/usr/lib/vmware/vsan/bin]
But since the disk has a permanent error, the only way to clear it is to remove and re-add it. Since this is a cache device for a disk group, this essentially removes the whole of the disk group. So essentially what you are doing is removing the whole disk group, and re-creating the same disk group. But this is what you would be doing if you had a cache device fail in production as well. Now, even after the disk group has been repaired, the object will still show up as degraded. You have the option of clicking “Repair Objects Immediately” in the health check, or you can simply wait for vSAN to realize that the disk issue has been addressed, and vSAN will repair it automatically.
I hope this goes some way towards explaining the behavior of a 3-node cluster when it comes to certain operations and failures. Rest assured that vSAN should always attempt to automatically recover absent components when the initial issue is addressed. This should also hit-home the benefit of a 4-node cluster, which will allow vSAN to self-heal in the event of a failure.
Nice article! Maybe you could mention the advantage of two diskgroups per host for the case a cache device fails. We always use at least two disk groups per host for our VSAN clusters.
Cormac, if you want to discuss the findings, please email me at the address used for this reply.. we never got any contact request from you via the blog.
I can assure you, the findings the original blog was based on are the result of investigating and remediating a customer failure (one of many failures we must remediate across our customer base whenever VSAN is involved), so we are simply documenting the observations. In this specific case, the three nodes were for a “management cluster” and was a bootstrapped (vCenter on the vsan cluster) VSAN 6.0 (RTM, no patches), each node with a single SSD and single disk drive. We even spoke to the VSAN development team (Junchi) based out of China as they wanted to “whiteglove” the 6.0 Patch 1… in the end, we switched that management cluster to a seperate NAS storage box and disabled VSAN as the customers request.
FYI, this was over 1 year ago when the issue occured.
Hi Neil, thanks for getting back to me. I used the email address on the web-site/blog. I think it was info@vifx.co.nz.
Interesting that it was an RTM build, and not a GA build. Perhaps that was something to do with the behaviour you observed.
The point I wanted to make with this blog is that those situations are not the norm – and the most recent versions of vSAN do not show this behaviour.
Sorry, my mistake, it was using GA… but, the issue was also seen by this person: https://communities.vmware.com/thread/528880