With that in mind, let’s now look at this in the context of vSAN stretched cluster.
Let’s take the example of a 3+3+1 vSAN stretched cluster configuration. This would entail 3 nodes at preferred site, 3 nodes at secondary site and then the witness node at a third site. The network is configured using L3 routing between data sites, and to the witness. Currently, with vSAN stretched clusters, virtual machines are deployed with a policy of FTT=1, as discussed above. This means that the first copy of the data get placed on preferred site, the second copy of the data on the secondary site, and the witness component gets placed on the third/witness site. So now lets discuss what happens when we introduce multiple failures.
Failure 1 – break the vSAN network between the data sites.
When we have a split-brain situation like this in vSAN stretched cluster, the witness will form a cluster with the preferred site. This is reflected in many places, including the health check. This is what such a situation might appear like:
What we can see here is that the witness host is now in a partition with 3 of the nodes (h, g & i) which are the hosts that are in the preferred site. You would also typically see vSphere HA failing all VMs over to the site which has quorum, i.e. the preferred site. This is all as expected. Let’s now introduce a second failure.
Failure 2 – break the network between the witness site and the data sites.
In this next scenario, I am going to remove the network routing from the witness site to both the preferred and secondary sites . This will isolate the witness host from all the other hosts in the cluster. This now means that the cluster is partitioned into 3; the hosts in the preferred data site are in one partition, the hosts in the secondary data site are in another partition and the witness is in a third.
We now end up in a situation that looks something like this:
I am now also in a situation where all of my virtual machine are inaccessible since a quorum cannot be reached (i.e. majority of components). This is what they look like in the vSphere web client when examined:
Recovery 1 – restoring witness network to the secondary site.
Now I will do my first recovery step. Since the witness and the preferred site were running for a period of time, their configuration sequence numbers (CSN) will be higher than that of the secondary site. Restoring the connection from the secondary site to the witness will form a cluster of the secondary hots and the witness, as shown here.
Now we can clearly see that the witness and the secondary nodes have formed a cluster. However the issue is that the VMs remain inaccessible. This is because of the configuration sequence number mismatch discussed previously. The reason we do not make the VMs accessible at this point is because there is a good chance that the contents of the data in the component on the preferred site is more up to date than the contents of the data on the secondary site. If we brought the object online, it will have old, out of date data.
Recovery 2 – restoring witness network to the preferred site.
Next, I put back the networking between the preferred site and the witness. The hosts in the preferred site now form a cluster with the witness. These have the same CSNs, so the virtual machines are now accessible.
And here is my list of virtual machines:
Now, as my final step, I can reinstate the connection between the hosts in the preferred site and the secondary site. This puts all of my hosts back into one partition:
Hopefully this post has provided a good explanation on the nature of double failures in a vSAN stretched cluster, and some of the considerations one may have when it comes to recovery.