As part of a quick reference proof-of-concept/evaluation guide that I have been working on, it has become very clear that one of the areas that causes the most confusion is what happens when a storage device is either manually removed from a host participating in the Virtual SAN cluster or the device suffers a failure. These are not the same thing from a Virtual SAN perspective.
To explain the different behaviour, it is important to understand that Virtual SAN has 2 types of failure states for components: ABSENT and DEGRADED.
- A component is DEGRADED if Virtual SAN has detected a failure from which it believes the component will not return (e.g. one residing on a failed disk drive)
- A component is ABSENT if Virtual SAN has detected a failure, but Virtual SAN believes that the component may re-appear with all data intact (e.g. components residing on disks on an ESXi host has been rebooted)
An ABSENT state reflects a transient situation that may or not resolve itself over time, and a DEGRADED state is a permanent state. This diagram might help to explain the differences:
It is interesting to note the difference between a disk that has been hot unplugged and a disk that has actually failed, as shown above. Since the disk may be reinserted, Virtual SAN treats the components on this disk as ABSENT. If a disk has a permanent failure, the components are marked as DEGRADED. This is important because there are many folks who feel that unplugging a drive from Virtual SAN should trigger immediate rebuilding of components – THIS IS NOT THE CASE. Virtual SAN has been designed so that if the administrator unplugs the incorrect device, or a host is inadvertently rebooted (either in error or due to a power outage) then an expensive rebuild operation is not immediately initiated. This gives administrators sufficient time to rectify the initial fault/condition.
When a component is marked as ABSENT, Virtual SAN will wait for 60 minutes (by default) for the components to become available once again. If they do not become available, and the timer expires, Virtual SAN will begin rebuilding the components elsewhere in the cluster.
DEGRADED components are rebuilt immediately.
Note that both ABSENT and DEGRADED are treated as a single failure by Virtual SAN; neither has any more impact than the other when it comes to number-of-failures-to-tolerate on the cluster. A component that is marked as ABSENT and a component that is marked as DEGRADED are both considered a single failure. If a VM is configured to tolerate a single failure, a VM remains available in the event of a component going DEGRADED or ABSENT.
If you wish to modify the time associated with the rebuilding of components, VMware has KB Article 2075456 detailing how to do it. While you might like to reduce this value for the purposes of evaluation and speeding up rebuild initiation, in production we would advise leaving the default value of 60 minutes.
If you are considering a Virtual SAN Proof Of Concept (POC), check out this recent document entitled Tips for a successful VMware Virtual SAN Evaluation.