As part of the enhancements to Virtual SAN 6.1, stretched cluster support was announced. To provide availability for virtual machines in a VSAN Stretched Cluster, vSphere HA needs to be configured. This allows VMs to be restarted on the same site (with affinity rules) when there is a host failure, or restarted on the remote site when there is a complete site failure. However there are certain settings that need to be configured in a specific way that are fundamental to achieving high availability in a VSAN stretched cluster. In this post, I will call out the VMware recommended settings, but I will also explain why we are recommending that vSphere HA be configured in this way on a VSAN Stretched Cluster. By following this guidance, you can be sure that your virtual machines get restarted on the same site (maintaining read locality) when there is a component/host failure on one site. It will also ensure that the virtual machines failover and restart on the remaining site in the event of a complete site failure.
vSphere HA should most definitely be turned on. This will make your virtual machines highly available in the VSAN stretched cluster. Host monitoring should also be enabled. This will allow hosts in the vSphere HA cluster to exchange heartbeats over the network, and ensure that all nodes continue to participate in the cluster, and are healthy.
Note: If there is a host failure, the virtual machines that reside on this host will be restarted on the remaining hosts in the cluster. However, the recommendation is to have vSphere HA restart the virtual machines on hosts on the same data site, and not have them restarted on the other data site. The reason for this is that if there was a failover to the other site, the virtual machines would read from the local copy of the data on the remote site and thus will have to rewarm their cache on that site, so there will be a temporary performance dip. We want to avoid this by keeping VMs on their selected site as much as possible, and thus using the local copy of the data and cache on that site. VM/Host affinity groups and rules should be created to achieve this, and these rules should be “soft”. By being “soft” or “should” rules, it means that all attempts are made to restart the VMs on the local site, but if that is not possible (e.g. full site failure), then the “soft” rules can be broken and the VMs may be restarted on the other site.I will delve into this in greater detail in a future post, as it plays an important role in VSAN stretched cluster.
Host Hardware Monitoring – VM Component Protection
VSAN does not support VMCP, VM Component Protection, at this time. Therefore this should be left unchecked.
This monitors the heartbeats of the virtual machines, and restarts the virtual machine if the heartbeats are not received over a period of time. This setting is optional, and is left up to the customers discretion. VMware supports having this feature either enabled or disabled.
Failure conditions and VM response
This is where the host isolation response is placed. Consider a situation where a network failure results in a host being isolated from the rest of the cluster. What do you wish to happen to those virtual machines that are on the isolated host? The VMware recommendation, when using vSphere HA in a VSAN stretched cluster, is to have the VMs powered off and restarted.
We are recommending this setting as it will take care of restarting virtual machines on the same site should a host get isolated, but it will also take care of restarting virtual machines on the other site should a complete site get isolated. The remaining failure conditions and responses, which mostly deal with APD & PDL events, can be left at the default setting, which is disabled. However additional advanced settings are needed to ensure that host isolation response works correctly, such as ignoring the default gateway as an isolation response IP address and choosing isolation response IP addresses local to each site. These are discussed shortly.
VMware supports an active/active VSAN stretched cluster configuration, in other words, running virtual machines at both data sites. Given this, we feel that admission control should be configured in such a way that will allow the complete workload to run on one remaining site if there is a complete site failure. With that in mind, the recommendation is to set admission control to a percentage value of 50%. This will leave 50% of the cluster’s CPU and Memory resources free, and should ensure that one data site can run all the virtual machines in the event of a complete failure of the other site.
Now, having made that recommendation, customers can obviously change this and consume more than 50% of the CPU & Memory resources and they will be fully supported in doing so. But keep in mind that if there is a site failure, not all virtual machines may be able to restart on the remaining site due to a lack of resources. This is a call that customers will have to make in their respective environments.
Datastore for heartbeating
VSAN does not support heartbeat datastore functionality, so this needs to be disabled. There is no disable button for heartbeat datastores, so if there are VMFS volumes or NFS volumes presented to the hosts in the cluster, these datastores may be automatically chosen for heartbeat datastores. This could result in unpredictable behaviour in the VSAN stretched cluster, especially when it comes to failover events. Therefore customers need to ensure that heartbeat datastores is not in use. If you have datastores presented to the hosts other than the VSAN datastore, you should select the “Use datastores only from the specified list, and then not select any, as shown below:
If there are no other datastores, and only the VSAN datastore, then this is not a concern and can be left at the default.
There are a number of advanced options that need to be added to ensure that host isolation works correctly when vSphere HA is configured on a VSAN stretched cluster. In a VSAN stretched cluster, one of the isolation addresses should reside in the site 1 data center and the other should reside in the site 2 data center. This would enable vSphere HA to validate complete network isolation in the case of a connection failure between sites. VMware recommends enabling host isolation response and specifying an isolation response addresses that is on the VSAN network rather than the default gateway on the management network. Therefore the vSphere HA advanced setting das.usedefaultisolationaddress should be set to false.
As stated, VMware recommends specifying two isolation response addresses, and each of these addresses should be site specific. In other words, select an isolation response IP address from the preferred VSAN stretch cluster site and another isolation response IP address from the secondary VSAN stretch cluster site. The vSphere HA advanced setting used for setting the first isolation response IP address is das.isolationaddress0 and it should be set to an IP address on the VSAN network which resides on the first site. The vSphere HA advanced setting used for adding a second isolation response IP address is das.isolationaddress1 and this should be an IP address on the VSAN network that resides on the second site.
Here is a summary of all the settings needed when enabling vSphere HA on top of VSAN stretched cluster.
|Host Hardware Monitoring – VM Component Protection: “Protect against Storage Connectivity Loss”
|Virtual Machine Monitoring
|Customer Preference – Disabled by default
|Define failover capacity by reserving a percentage of cluster resources. Set to 50% for both CPU & Memory.
|Host Isolation Response
|Power off and restart VMs
|“Use datastores only from the specified list”, but do not select any datastores from the list. This disables Datastore Heartbeats
|IP address on VSAN network on site 1
|IP address on VSAN network on site 2
The VSAN 6.1 Stretched Cluster Guide, which is now available, will cover these settings in more detail. Please refer to this guide if planning to implement a VSAN stretched cluster.