In a previous post, I discussed the difference between a component that is marked as ABSENT, and a component that is marked as DEGRADED. In this post, I’m going to take this up a level and talk about objects, and how failures in the cluster can change the status of objects. In VSAN, and object is made up of one or more components, so for instance if you have a VM that you wish to have tolerate a number of failures, or indeed you wish to stripe a VMDK across multiple disks, then you will certainly have multiple components making up the VSAN object. Read this article for a better understanding of object and components. Object compliance status and object operation status are two distinct states that an object may have. Let’s look at them in more detail next.
Object compliance status
If a virtual machine object is configured in such a way as to meet its VM Storage Policy settings, it is said to be compliant. If however, due to a failure in the VSAN cluster, it may no longer be able to meet the requirements placed in the VM Storage Policy, it is said to be not compliant. If you are unclear about what VM Storage Policies are in the context of VSAN, read this for more information.
In the screenshot shown below, only one issue has occurred in the cluster (let’s say it is a host reboot). This has made a number of objects enter into a “Not Compliant” state. By looking closer at the objects in question in the Physical Disk Placement tab, we can see that one of the components of the VM Home Namespace object has entered into an ABSENT state (discussed in the earlier part 30 post). However the object’s Operational Status remains healthy as it still has greater than 50% of its components available and one full mirror is still intact. Actually, in this case it is only the witness component that is affected.
This implies that the virtual machines remains available and accessible, and there is no disruption to the end-user. If the witness remains absent for longer than 60 minutes (by default), it will be rebuilt on another node in the cluster, if there are sufficient resources available.
Object operational state
The operation status of an object can be healthy or unhealthy, depending on the type of failure and number of failures in the cluster. If a full mirror is still available, and more than 50% of the object’s components are available, the object’s Operational State is said to be healthy, as seen in the previous example. If no full mirror is available, or less than 50% of the components are available (possibly due to multiple failures in the cluster when the VM is configured to tolerate 1 failure only), then the object’s operational state is said to be unhealthy.
If we take the example that there were two failures in the cluster, the Compliance Status remains at Not Compliant, but the Operation Status changes to Unhealthy. The reason for the unhealthy state is displayed. In this example, the reason for the “Unhealthy” Operations State is “disk unavailable” as shown below:
At this point, the operation state of the object is unhealthy because the object is no longer accessible. If an object has been setup to tolerate X number of failures in the cluster, and the cluster suffers greater than X number of failures, then the object may certainly become inaccessible. If in turn a VM becomes inaccessible, it means that at least one object of the VM is completely down (temporarily or permanently) so either there is no full mirror of the object (the failures have impacted both mirrors), or less than 50% of the components are available (the failures have impacted a mirror and witnesses).
However it should be understood that this state is a transient, and not a permanent state. As soon as the underlying issue has been rectified and once a full mirror copy and more than 50% of an objects components become available, the object would once again become accessible.
How does object operational state impact a VMs accessibility?
Let’s elaborate a little on how availability of objects can impact the state of a VM.
(1) VM Home Namespace object goes inaccessible
If a running VM has its VM Home Namespace object go inaccessible due to failures in the cluster, a number of different things may happen. One of these includes the VM powering off. Once the VM is powered off, it will be marked “inaccessible” in the vSphere web client UI. There can also be other side effects, such as the VM getting renamed in the UI to its vmx path rather than VM name. This is visible in the screen shots above, as you can see the VM being renamed from vm1 in the first screen shot to its vmx path in the second screen shot.
(2) Virtual disk object goes inaccessible
If a running VM has one of its disk objects go inaccessible, the VM will keep running, but the VMDK object has all its I/O stalled. Typically, the Guest OS will eventually time out I/O. Some Windows versions may BSOD when this occurs. Many Linux distributions may downgrade the filesystem on the VMDK to read-only. The Guest OS behavior, and even the VM behavior, is not Virtual SAN specific. This is simply ESXi behavior to APD (All Paths Down). Once the VM becomes accessible again, the status should resolve, and things go back to normal. Of course, nothing ever gets corrupted.
How are multiple failures handled?
So how does VSAN avoid corruption when there are multiple failures in the cluster. The following scenario will describe how this is handled. Let’s take a VM Storage Policy that configures objects with NumberOfFailuresToTolerate=1. These objects typically have 2 replicas (A, B) and 1 witness (W). Imagine the host containing replica A crashes. As discussed above, in order to restore I/O flow, Virtual SAN will use components B and W in the active set. Virtual SAN allows replica A to become “STALE”.
Now assume that the host holding the witness W also crashes. At this point, the object will lose availability as less than 50% of the components are available. This is expected. The user asked Virtual SAN to tolerate 1 concurrent failure, but 2 concurrent failures have now occurred on the cluster.
Now the host that contains replica A comes back online. You might think that since there are now 2 out of 3 components available (i.e. quorum) and a full replica of the data, the object should now be accessible. However, the object would still not be accessible, because replica A is STALE and is not up to date. Replica B has had a number of updates not yet synced to replica A.
Virtual SAN requires a quorum of up-to-date components, but here we have the following situation: A is present yet STALE, B is present and up to date, W is missing. So VIRTUAL SAN only has 1 out of 3 up-to-date components and will not make the object accessible. This is a necessary precaution as otherwise we could construct a scenario where VIRTUAL SAN would incorrectly conclude an object is up-to-date when it is actually not. So in this scenario, even though A came back up, VIRTUAL SAN would still consider this a double failure.
When the witness (w) returns, the object can now become accessible. This is because we now have two out of three up-to-date components forming a quorum. The up-to-date replica (B) can be synced to the STALE replica (A) and when complete, the component will be placed in ACTIVE state.