Supporting Fault Tolerance VMs on vSAN Stretched Cluster
During one of our many discussions at VMworld 2017, I was asked about supporting Fault Tolerance on vSAN Stretched Clusters, more specifically SMP-FT. Now to be clear, we can support SMP-FT on vSAN since version 6.1. The difficulty with supporting SMP-FT on vSAN stretched cluster has always been the possible latency between the data sites, which could be up to as much as 5ms. This is far too high to support SMP-FT on a VM that has data replicating between data sites, and for that reason, we stated categorically that we could not support SMP-FT on VMs deployed on vSAN stretched cluster. However, new enhancements to version 6.6 of vSAN has meant that we can now revisit this support statement.
In vSAN 6.6, a number of new enhancements were made to the vSAN stretched cluster implementation. From a policy perspective, we now have the ability to configure both a PFTT (primary failures to tolerate) across sites and an SFTT (secondary failures to tolerate) within sites. These new policies also allow us to set a PFTT of 0, which means a VM with this policy is not replicated/protected, and resides completely on one site only with no dependency on the other site. Through another policy setting (site affinity), an admin can choose which of the data sites to place the VM on.
This raises an interesting point for SMP-FT. If we use the PFTT=0, which basically instantiate a VM on a single site, any VM with this policy setting will not incur the latency from the cross site link. In other words, all of the objects or components that make up that VM will all reside on the same site. So this would be just like deploying a VM on a standard vSAN cluster rather than a vSAN stretched cluster. After discussing this with our PM and engineering team, we are now in a position to support SMP-FT on VMs with a PFTT=0 policy in vSAN stretched cluster.
Just to be clear, this is supporting SMP-FT on a VM on a vSAN stretched cluster only as long as the VM is “pinned” to a single site using PFTT=0. There is still no support for SMP-FT VMs deployed across a vSAN stretched cluster using a PFTT=1 setting. Of course, you will still need to have your DRS VM-to-Host affinity groups setup to keep the SMP-FT VM compute and memory (for primary and secondary) tied to the same set of hosts within a site. And in our testing, the secondary FT VM’s objects and components also gets the same policy as the primary FT VM, e.g. PFTT=0 and site affinity, so that also ensures that storage is tied to the same site as well.
Caution: It should also be noted that there are no guard-rails in place to stop you changing the policy of the SMP-FT VM to a from a PFTT value = 0 to a PFTT value = 1. However, due to the latency involved when replicating the VM across sites, you may find that SMP-FT will no longer able to protect the VM. It is up to the admin to ensure that the policy of the SMP-FT VM is not changed from PFTT = 0.
We will shortly update the vSAN stretched cluster documentation to reflect this new support statement.
7 Replies to “Supporting Fault Tolerance VMs on vSAN Stretched Cluster”
Hey Cormac. Thanks for this blog. I really appreciate you following this up. I guess I wasn’t the only one who asked you about this at VMworld 🙂
Thanks for your input Roy.
thanks Cormac. I have an other question? Do traditional stretch storage such as Netapp MetroCluster or IBM SVC Hyperswap support vSMP-FT ?
I’m afraid I don’t know. From KB 2007545, it looks like it is supported on VPLEX, but I guess you would need to ask NetApp or IBM for their respective environments.
Really good news!
Hi Cormac, is the limitation purely on the latency? What if the latency between the data sites is only 1ms when using stretched clusters?
Not supported Johann – latency is certainly one limitation, but there may be others that I am unaware of.
Comments are closed.