VSAN Part 9 – Host Failure Scenarios & vSphere HA Interop

In this next post, I will examine some failure scenarios. I will concentrate of ESXi host failures, but suffice to say that a disk or network failure can also have consequences for  virtual machines running on VSAN. There are two host failure scenarios highlighted below which can impact a virtual machine running on VSAN:

  1. An ESXi host, on which the VM is not running but has some of its storage objects, suffers a failure
  2. An ESXi host, on which the VM is running, suffers a failure

Let’s look at these failures in more detail.

Let’s take the simplest configuration; 3 node cluster with a VM deployed with the default policy of ‘FailuresToTolerate = 1‘.

3 node normalIn the first failure scenario, assume that one ESXi host (node) in the VSAN cluster suffers a crash/failure. The ESXi host on which the virtual machine is running is unaffected, therefore the VM itself continues to run. Well, what about if that ESXi hosts that failed held some of the VM’s storage objects? Not an issue, since there will be a full copy of the VM’s storage objects available elsewhere in the cluster. This is because all VMs deployed on VSAN have a ‘FailuresToTolerate = 1’ by default, meaning the VM can tolerate at least one (host/disk/network) failure by nature of having a mirror (RAID-1) configuration.

A reconstruction of the replica storage objects that resided on the failed node is started after a timeout period of 60 minutes (this will allow enough time for host reboots, short periods of maintenance, etc). Once the reconstruction of the storage objects is completed, the cluster directory service is updated with information about where the VM’s storage objects reside in the cluster. There is a video of this behaviour here.

host failure 1If we look at the second scenario where the virtual machine was running on the failing ESXi host, then vSphere HA (which inter-operates with VSAN) will restart the virtual machine on a remaining host in the VSAN cluster if configured to do so. Even if that ESXi host which failed also contained storage object replicas, there should still be at least one full mirror/replica of the virtual machines storage objects in the cluster. Again, as before, a reconstruction of the storage objects that used to reside on the failed node is started after a timeout period. There is another video of this behaviour here.

host failure 2Resynchronization Behaviour

VSAN maintains a bitmap of changed blocks in the event of components of an object being unable to synchronization due to a failure of a host, network or disk. This allows updates to VSAN objects composed of two or more components to be reconciled after a failure.

For example, in a distributed RAID-1 (mirrored) configuration, if a write is sent to nodes A and B for object X, but only A records the write before a cluster-wide power failure, on recovery, A and B will compare their logs for X and A will deliver its copy of the write to B.

vSphere HA Interoperability

vSphere HA is fully supported on VSAN cluster to provide additional availability to virtual machines deployed in the cluster. However, a number of significant changes have been made to vSphere HA to ensure correct interoperability with VSAN. Notably, vSphere HA agents communicate over the VSAN network when the hosts participates in a VSAN cluster. The reasoning behind this is that VMware wishes for HA & VSAN nodes to be part of the same partition in the event of a network failure; this avoid conflicts when there is different partitions between HA & VSAN, with different partitions laying claim to the same object. vSphere HA continues to use the management network’s gateway for isolation detection however.

Another noticeable difference with vSphere HA on VSAN is that the VSAN datastore cannot be used for datastore heartbeats. These heartbeats play a significant role in determining virtual machine ownership in the event of a vSphere HA cluster partition event. This means that if partitioning occurs, vSphere HA cannot use datastore heartbeats to determine if another partition can power on the virtual machines before this partition powers it off. This feature is very advantageous to vSphere HA when deployed on shared storage, as it allows some level of coordination between partitions. This feature is not available to VSAN deployments since there is no shared storage. If a VSAN cluster partitions, there is no way for hosts in one partition to access the local storage of hosts on the other side of the partition; thus no use for vSphere HA heartbeat datastores.

Note however that if VSAN hosts also have access to shared storage, either VMFS or NFS, then these datastores may still be used for vSphere HA heartbeats.

vSphere HA needs to store the protection metadata for each virtual machine in the cluster. On traditional datastores, this was stored in a folder on the root directory of each datastore and labeled ‘.vSphere-HA’. In VSAN, this is done differently. Instead of storing it in the root directory of a datastore, the vSphere HA protection metadata is now stored in the virtual machine’s namespace directory, along with the virtual machines configuration files. You can learn more about VM object layout on VSAN by reading this article on objects and components.

Check out all my VSAN posts here.

13 comments
  1. Hey there 🙂

    The datastores for HA topic is very interesting – does it mean that it’s not possible to build a ‘datacentre in a box’ because of the requirement for external ‘heartbeat’ datastores?

    • There is no requirement for heartbeat datastores. The reason you do not have this functionality when you only have a VSAN datastore is because HA will use the VSAN network for heartbeating. So if a host is isolated from the VSAN network and cannot send heartbeats, it is safe to say that it will also not be able to update a heartbeat region remotely 🙂

      So this is not really that big of a constraint or an issue, it is just the nature of the beast.

    • Not at all – you can certainly setup your DC in a box. Adding to what Duncan state’s above, vSphere HA heartbeat datastores are useful in shared storage environments. For instance, if there was a cluster partition, a master would use the heartbeat datastores to determine if hosts in the other partition have failed, or are actually still running.

      Since there is no way for hosts on either side of a partitioned VSAN cluster to access the storage on the other side of the partition, VSAN hosts cannot make use of heartbeat datastores.

  2. Great article Cormac!

    I already hinted at this on twitter, but what to do if total disaster strikes? Eg. the customer was silly enough to ‘FailuresToTolerate = 0’ or has a copy, but it is either corrupt or both failed (for example ignored the first fail over for far too long)

    I know I know, it _should_ not happen, but live has learned that what can happen, will happen.
    For whatever reason there will be people who neglect a good backup and they will want to recover their data. As the new filesystem for VSAN is a proprietary one, is there any chance on recovery using forensics? Or is a customer in this case at the end of the line?

    • This will be the same scenario as having a VM on local storage, and the local disk fails. Let’s hope the customer has a good backup. 🙂 I’m not even sure GSS would be able to help you there.

  3. Very clear explanition of failures of ESXi hosts, and nice to know how vSAN heals in case of host reboots.

    Will Update Manager be vSAN aware, so updating a cluster is done in respect to vSAN healing times after each individual hosts has been updated.

    Will there be a post about how vCenter failures is handled, i’m especially interested in the senario where the vCenter Server is completely lost, and has to be reinstalled from scratch.

    The above is very interesting in small 3 host setups where vCenter (Server Appliance) is moved to a vSAN volume after initial deployment on standard local stoage.

    • I think your first question is related to rolling upgrades, right? In other words, can we place a host in maintenance mode, updated it and join it back into the cluster (rinse and repeat)? I will have to get back to you on that. Since its still beta, upgrades are not something I’ve looked at.

      With regards to vCenter failures, there is nothing necessary in vCenter to keep VSAN running. HA can restart your vCenter VM on another host in the cluster if the host on which vCenter is running fails. That said, if you had to rebuild vCenter from scratch, the only thing that would be missing are your VM Storage Policies. However these are still associated with your VM Storage Objects on the VSAN datastore so can be retrieved and rebuilt. We will get this procedure documented – watch this space Ulrik.

    • Not really – when you think of VSAN, you can think of the whole cluster as being a hot-spare, so if any one component failed, the cluster will handle it (so long as FailuresToTolerate is at least set to 1).

  4. Great series Cormac. Is there any concept planned to allow logical grouping of hosts and forcing replicas to be spread between groups, a bit like anti-affinity? More than once in my career at different companies I have seen the loss of a single rack, usually due to AC leak or similar. I would want to be able to spread cluster nodes across at least two racks or rows, and let VSAN spread the replicas for resilience. Network traffic is of course a consideration when increasing the spread, but worth it for the additional resilience in some environments.

    • Yes – this is definitely something a lot of our customers are looking for. I’m not sure if we will have this in the 1.0 version, but it should be available soon after if not.

Comments are closed.