Another recovery from multiple failures in a vSAN stretched cluster

Cormac

7 years ago

In a previous post related to multiple failures in a vSAN stretched cluster, we showed that if a failure caused the data components to be out of sync, the most recent copy of the data needs to recover before the object becomes accessible again. This is true even if there are a majority of objects available (e.g. old data copy and witness). This is to ensure that we do not recover the “STALE” copy of the data which might have out of date information. To briefly revisit the previous post, the accessibility of the object when there are multiple failures with FTT=1 (RAID-1 mirror) can be described as follows:

1st failure – 1st data copy unavailable, but witness+2nd copy available – object accessible
2nd failure occurs – 2nd data copy goes unavailable – object inaccessible
1st data copy and witness recover – object remains inaccessible
Both data copies and witness recover – object accessible

We cannot make the object accessible at step 3 because the 1st data component is considered “stale”. vSAN knows this when it compares its configuration sequence number (CSN) to the CSN of the witness. This implies that the data component could have out of date data, so vSAN does not make it accessible (there is more detail on the CSN in the previous post). Now I am going to look at the same scenario, but introduce failures in a different order:

1st failure – witness unavailable, but 1st+2nd data copy available – object accessible
2nd failure occurs – split-brain between 1st and 2nd copy – object inaccessible
1st data copy and witness recover – object becomes accessible
Both data copies and witness recover – object accessible

The interesting point here is that even though the same components are recovered at step 3 when compared to the previous scenario, this time the object can be made accessible. This is because even though the witness is older/stale when compared to the copy of the data, there is no actual data on the witness. Therefore vSAN can make the object accessible in the knowledge that we have the latest data.

Again, let’s discuss this in the context of vSAN stretched cluster. As before, we have a 3+3+1 vSAN stretched cluster configuration. This has 3 nodes at preferred site, 3 nodes at secondary site and then the witness node at a third site. The network is configured using L3 routing between data sites, and to the witness. As described in the previous post, with vSAN stretched clusters, virtual machines are deployed with a policy of FTT=1. This means that the first copy of the data get placed on the preferred site, the second copy of the data on the secondary site, and the witness component gets placed on the third/witness site.

Failure 1 – remove access to the witness from both data sites.

This time around, I will remove the witness from both of the data sites in the vSAN stretched cluster. As expected, the data nodes form a cluster, and the witness host/appliance is isolated, as can easily be observed in the vSAN health checks:

So in this case, the witness component will have an older CSN (e.g. 1234) and the data components will have a newer CSN (e.g. 5678) – numbers chosen for illustration purposes only.

Failure 2 – split-brain between preferred and secondary data sites.

I will now introduce a second failure. This time I will introduce a split-brain between both of the data sites. This will be my second failure, so I would expect to see 3 partitions – preferred site hosts are in 1 partition, secondary site hosts are in another, and the witness is in a third:

And of course, with this double failure, my virtual machines go inaccessible because nobody has quorum:

Now lets recover the witness. In the previous scenario, after a double failure, when I had the witness and the older data copy recovered, vSAN could not make the object accessible because the configuration sequence number (CSN) on the witness was newer than the CSN on the data component. This essentially meant that the object (e.g. a VMDK) was “updated” after that data component went absent, an indication that the data on the data component is considered “stale”.

However, when we recover the witness in this current scenario, the CSN on the data object is newer than the CSN on the witness, which means we have the latest data on that component, and so we can make the object accessible (effectively bumping up the CSN on the witness).

Recovery 1 – restore the witness to both the data sites

After recovering the witness, I see that it forms a cluster with the 3 hosts that are on my preferred vSAN stretched cluster site, hosts g, h & i. This is because we still have a split-brain situation where the hosts on the preferred site (g,h,i) and the hosts on the secondary site (n,o,p) can no longer communicate. In this scenario, the witness will form a cluster with the preferred site, as shown here:

But now the major difference to the previous scenario is that even though the CSNs between the witness and hosts is different, the VMs can be made accessible because the CSN on the data copy is newer than the CSN on the witness. So soon after restoring the witness, the VM becomes accessible.

Again, I hope this is a useful explanation to the way things work in environments that experience multiple failures, even though the VMs (objects) are only setup to tolerate a single failure. vSAN has guard-rails in place to ensure the older/stale data is not recovered ahead of the latest, most up-to-date data.

[Update] I got an interesting question today from one of my colleagues. What if the witness came back and got paired up with a data copy that has a CSN that is more recent than the witness, but is lower than the other data copy. If this were possible, would we end up losing data? Well, this situation is not possible. If the witness is already partitioned, and then the data components partition after that, then they will both have the same CSN. Neither of the data components can be updated as neither of them have a quorum, so both are partitioned at the same time with the same CSN. So really it doesn’t matter which one recovers first and joins to the witness, the data will always be the latest copy. Good question though.