Understanding recovery from multiple failures in a vSAN stretched cluster

Sometime back I wrote an article that described what happens when an object deployed on a vSAN datastore has a policy of Number of Failures to Tolerate set to 1 (FTT=1), and multiple failures are introduced. For simplicity, lets label the three components that make up our object with FTT=1 as A, B and W. A and B are data components and W is the witness component. Let’s now assume that we lose access to component A. Components B & W are still available, and the object (e.g. a VMDK) is still available. The state of these two components (B & W) changes, meaning that they are no longer in sync with component A.  Component A is now considered “stale”. This tracking of state is done in vSAN via configuration sequence numbers, or CSN for short. Now there is a second failure, which means component B is not longer accessible from/to component W. This renders the whole object inaccessible, since it was only setup to tolerate 1 failure, and now it has multiple failures. After a period of time, let’s say that component A recovers and can talk with witness W. Now we have component A with an older CSN (1234) and component W with a newer CSN (5678) – numbers chosen for illustration purposes only. As highlighted in the previous post, component A is considered STALE. This means that the object represented by these components cannot become accessible because the data on A maybe older/out-of-date. In this case, we would need component B to recover. Component B (with the latest data) & component W would have the same configuration sequence number. These  form a quorum and would make the actual object accessible. Component A would then be able to synchronize from B for the latest data.

With that in mind, let’s now look at this in the context of vSAN stretched cluster.

Let’s take the example of a 3+3+1 vSAN stretched cluster configuration. This would entail 3 nodes at preferred site, 3 nodes at secondary site and then the witness node at a third site. The network is configured using L3 routing between data sites, and to the witness. Currently, with vSAN stretched clusters, virtual machines are deployed with a policy of FTT=1, as discussed above. This means that the first copy of the data get placed on preferred site, the second copy of the data on the secondary site, and the witness component gets placed on the third/witness site. So now lets discuss what happens when we introduce multiple failures.

Failure 1 – break the vSAN network between the data sites.

When we have a split-brain situation like this in vSAN stretched cluster, the witness will form a cluster with the preferred site. This is reflected in many places, including the health check. This is what such a situation might appear like:

What we can see here is that the witness host is now in a partition with 3 of the nodes (h, g & i) which are the hosts that are in the preferred site. You would also typically see vSphere HA failing all VMs over to the site which has quorum, i.e. the preferred site. This is all as expected. Let’s now introduce a second failure.

Failure 2 – break the network between the witness site and the data sites.

In this next scenario, I am going to remove the network routing from the witness site to both the preferred and secondary sites . This will isolate the witness host from all the other hosts in the cluster. This now means that the cluster is partitioned into 3; the hosts in the preferred data site are in one partition, the hosts in the secondary data site are in another partition and the witness is in a third.

We now end up in a situation that looks something like this:

I am now also in a situation where all of my virtual machine are inaccessible since a quorum cannot be reached (i.e. majority of components). This is what they look like in the vSphere web client when examined:

Recovery 1 – restoring witness network to the secondary site.

Now I will do my first recovery step. Since the witness and the preferred site were running for a period of time, their configuration sequence numbers (CSN) will be higher than that of the secondary site. Restoring the connection from the secondary site to the witness will form a cluster of the secondary hots and the witness, as shown here.

Now we can clearly see that the witness and the secondary nodes have formed a cluster. However the issue is that the VMs remain inaccessible. This is because of the configuration sequence number mismatch discussed previously. The reason we do not make the VMs accessible at this point is because there is a good chance that the contents of the data in the component on the preferred site is more up to date than the contents of the data on the secondary site. If we brought the object online, it will have old, out of date data.

Recovery 2 – restoring witness network to the preferred site.

Next, I put back the networking between the preferred site and the witness. The hosts in the preferred site now form a cluster with the witness. These have the same CSNs, so the virtual machines are now accessible.

And here is my list of virtual machines:

Now, as my final step, I can reinstate the connection between the hosts in the preferred site and the secondary site. This puts all of my hosts back into one partition:

Hopefully this post has provided a good explanation on the nature of double failures in a vSAN stretched cluster, and some of the considerations one may have when it comes to recovery.

One Reply to “Understanding recovery from multiple failures in a vSAN stretched cluster”

Comments are closed.