How many hosts are needed to implement SFTT in vSAN Stretched Cluster?

Many of you who are well versed in vSAN will realize that we released a Secondary Failures To Tolerate (SFTT) feature with vSAN 6.6. This meant that not only could we tolerate failures across sites, but that we could also add another layer of redundancy to each copy of the data maintained at each of the data sites. Of course the cross site replication (now referred to as PFTT or Primary Failures To Tolerate) is still based on RAID-1 mirroring and this continues to require a third site for the witness appliance, so that quorum can be obtained in the case of a full site failure, or cross-site link failure. However, there may still be some confusion around the role of the witness appliance and SFFT. Let’s try to clear that up now with a discussion around the various SFTT and Fault Tolerance Methods, such as RAID-6, RAID-5 and RAID-1.

RAID-6 SFTT

To implement a RAID-6 SFTT, one would require 6 hosts at the primary site and 6 hosts at the secondary site, as well as a host to facilitate the witness appliance. This configuration is referred to as a 6+6+1 configuration. The SFTT Fault Tolerance Method (FTM) must be the same as both of the data sites; we cannot mix FTMs at this point in time. So with a RAID-6 FTM, a configuration might look something like this:

RAID-6 is implemented as a 4+2 on vSAN; 4 data segments and 2 parity segments, all on separate hosts for availability reasons. Here we can see the 6 hosts per site, as well as the distinction between data and parity disks. Such a configuration allows for a complete site failure and up to two local failures at the remaining site.

RAID-5 SFTT

To implement a RAID-5 SFTT, one would require 4 hosts at both data sites, as well as a witness as a third site. This is a 4+4+1 configuration. RAID-5 is implemented as 3 data segments and 1 parity segment, and a configuration would look something like this:

This RAID-5 configuration can tolerate 1 failure on the remaining site, even after a full site failure.

RAID-1 SFTT

This final SFTT is using RAID-1. This one seems to be slightly confusing as it is using a witness (or witnesses) locally at each data site for SFTT, as well as using a witness on the witness appliance to achieve PFTT. Now with RAID-1, we can tolerate 1, 2 or 3 failures. The rule of thumb is that to tolerate “n” failures, you need “2n+1” hosts. In this configuration, I have deployed SFTT to tolerate 1 failure, and set the FTM to RAID-1. Therefore this configuration requires a 3+3+1 setup. I’d expect a layout something similar to the following:

So now we have the RAID-1 mirror (made up of 2 copies of the data and a witness) between sites, as well as a RAID-1 (with 2 copies of the data and a witness) within each of the sites. But the point I want to make is that there is a local witness copy – the witness appliance on the third site is not used to implement SFTT, only PFTT.

For FTM = RAID-1 to have an SFTT of 2, the configuration would need to be 5+5+1 (remember 2n+1 in this case is 5).

For FTM = RAID-1 to have an SFTT of 3, the configuration would need to be 7+7+1 (again, 2n+1 in this case is 7).

Although, using FTM of RAID-1 is going to consume a lot of space, so consider RAID-5 or RAID-6 for their space-saving techniques.

RAID-6 SFTT

RAID-5 SFTT

RAID-1 SFTT

Published by Cormac