Another recovery from multiple failures in a vSAN stretched cluster
In a previous post related to multiple failures in a vSAN stretched cluster, we showed that if a failure caused the data components to be out of sync, the most recent copy of the data needs to recover before the object becomes accessible again. This is true even if there are a majority of objects available (e.g. old data copy and witness). This is to ensure that we do not recover the “STALE” copy of the data which might have out of date information. To briefly revisit the previous post, the accessibility of the object when there are multiple failures with FTT=1 (RAID-1 mirror) can be described as follows:
- 1st failure – 1st data copy unavailable, but witness+2nd copy available – object accessible
- 2nd failure occurs – 2nd data copy goes unavailable – object inaccessible
- 1st data copy and witness recover – object remains inaccessible
- Both data copies and witness recover – object accessible
We cannot make the object accessible at step 3 because the 1st data component is considered “stale”. vSAN knows this when it compares its configuration sequence number (CSN) to the CSN of the witness. This implies that the data component could have out of date data, so vSAN does not make it accessible (there is more detail on the CSN in the previous post). Now I am going to look at the same scenario, but introduce failures in a different order:
- 1st failure – witness unavailable, but 1st+2nd data copy available – object accessible
- 2nd failure occurs – split-brain between 1st and 2nd copy – object inaccessible
- 1st data copy and witness recover – object becomes accessible
- Both data copies and witness recover – object accessible
The interesting point here is that even though the same components are recovered at step 3 when compared to the previous scenario, this time the object can be made accessible. This is because even though the witness is older/stale when compared to the copy of the data, there is no actual data on the witness. Therefore vSAN can make the object accessible in the knowledge that we have the latest data.
Again, let’s discuss this in the context of vSAN stretched cluster. As before, we have a 3+3+1 vSAN stretched cluster configuration. This has 3 nodes at preferred site, 3 nodes at secondary site and then the witness node at a third site. The network is configured using L3 routing between data sites, and to the witness. As described in the previous post, with vSAN stretched clusters, virtual machines are deployed with a policy of FTT=1. This means that the first copy of the data get placed on the preferred site, the second copy of the data on the secondary site, and the witness component gets placed on the third/witness site.
Failure 1 – remove access to the witness from both data sites.
This time around, I will remove the witness from both of the data sites in the vSAN stretched cluster. As expected, the data nodes form a cluster, and the witness host/appliance is isolated, as can easily be observed in the vSAN health checks:
So in this case, the witness component will have an older CSN (e.g. 1234) and the data components will have a newer CSN (e.g. 5678) – numbers chosen for illustration purposes only.
Failure 2 – split-brain between preferred and secondary data sites.
I will now introduce a second failure. This time I will introduce a split-brain between both of the data sites. This will be my second failure, so I would expect to see 3 partitions – preferred site hosts are in 1 partition, secondary site hosts are in another, and the witness is in a third:
And of course, with this double failure, my virtual machines go inaccessible because nobody has quorum:
Now lets recover the witness. In the previous scenario, after a double failure, when I had the witness and the older data copy recovered, vSAN could not make the object accessible because the configuration sequence number (CSN) on the witness was newer than the CSN on the data component. This essentially meant that the object (e.g. a VMDK) was “updated” after that data component went absent, an indication that the data on the data component is considered “stale”.
However, when we recover the witness in this current scenario, the CSN on the data object is newer than the CSN on the witness, which means we have the latest data on that component, and so we can make the object accessible (effectively bumping up the CSN on the witness).
Recovery 1 – restore the witness to both the data sites
After recovering the witness, I see that it forms a cluster with the 3 hosts that are on my preferred vSAN stretched cluster site, hosts g, h & i. This is because we still have a split-brain situation where the hosts on the preferred site (g,h,i) and the hosts on the secondary site (n,o,p) can no longer communicate. In this scenario, the witness will form a cluster with the preferred site, as shown here:
But now the major difference to the previous scenario is that even though the CSNs between the witness and hosts is different, the VMs can be made accessible because the CSN on the data copy is newer than the CSN on the witness. So soon after restoring the witness, the VM becomes accessible.
Again, I hope this is a useful explanation to the way things work in environments that experience multiple failures, even though the VMs (objects) are only setup to tolerate a single failure. vSAN has guard-rails in place to ensure the older/stale data is not recovered ahead of the latest, most up-to-date data.
[Update] I got an interesting question today from one of my colleagues. What if the witness came back and got paired up with a data copy that has a CSN that is more recent than the witness, but is lower than the other data copy. If this were possible, would we end up losing data? Well, this situation is not possible. If the witness is already partitioned, and then the data components partition after that, then they will both have the same CSN. Neither of the data components can be updated as neither of them have a quorum, so both are partitioned at the same time with the same CSN. So really it doesn’t matter which one recovers first and joins to the witness, the data will always be the latest copy. Good question though.
Hi Cormac,
Great stuff as always, but …
1. Is there the ability to manually promote a “stale” copy of the data (i.e. you temporarily lose site A and then permanently lose B)?
I think most customers would expect the system to still be accessible in this scenario, as you do have a valid, but “old” copy of the data.
2. Are there plans to make the witness passive (i.e. the witness is only used to determine if a split brain occurs like VPLEX)?
3. Are there plans to make the preferred site configurable at the VM level so in the event of a split brain VMs can continue to run at both sites?
Points 2 and 3 bring vSAN more inline with VPLEX and would increase the availability of the solution.
I know we have discussed most of these in the past, but I do think that a stretched cluster needs to provide the highest levels of availability possible under all scenarios.
Many thanks
Mark
Hi Mark,
Re 1, yes there is. Our support guys have tooling that can be used to do just that.
Re 2, the vSAN witness does not hold any data. It only holds the quorum/vote when there is a failure. This is an essential component in how vSAN is architected.
Re 3, this has been a regular ask from our customers. It is something that is being evaluated right now. I can’t say much more than that.
Hi Cormac,
Regarding Re1, we actually faced that scenario in our environment, ( host A went purple screen, then some time later host B had a faulty disk, we recovered host A, now host A disk is showing as STALE and we cant recover host B disk ..)
We went through with Vmware for like 10 hours and this was said it is unrecoverable, any chance this tooling you mentioned is not know by vmware support ? Is there anything that we can do to recover ? As info the vmdk we lost was a big one (4TB).
We’ve contacted the escalation team on your behalf. Hopefully someone will be following up with you soon.
Thanks Cormac ! I appreciate !
Thanks Cormac – that’s good news.
Hi Cormac,
Just another thought on this for the scenario when you lose the witness and one site or one node in a 2-node ROBO configuration is there/will there be a tool which enables us to manually enforce quorum so the user can get access to their data?
Again I think most customers would not be happy if they had a full copy of their data available but it could not be accessed because a remote witness has gone down.
I assume from what you had said that the tool to promote a “stale” copy of data will only be available from support – are there plans to put this into the UI so the end-user can run it themselves?
Any thoughts would be appreciated.
Many thanks
Mark
Hi Cormac,
I have been working on a project whereby the customer has two buildings which are ~750 metres apart and they want to create a single cluster with 2 or 3 nodes per building.
Now at first sight this would be a classic stretched cluster scenario and would require the Enterprise licence, but why not save money and use the Standard licence and Fault Domains?
At the end of the day a Stretched Cluster is just two Fault Domains with the following additions:
1. The witness host
2. Read locality
3. Preferred Site
Now if the customer is prepared to accept the fact that if the link between the sites goes down they will have a split brain then why not just use the Standard licence?
Looking back through some of your previous blogs I see that there is a loose definition for each technology as follows:
1. Fault Domains – between racks
2. Stretched Clusters – between sites
Now the problem with this is there is no mention of distance between racks, I would assume that Fault Domains must:
1. Be connected to the same ToR switches
2. Be within the same DC/building/computer room
3. Be no more than 100 metres apart (or less as defined by VMware)
Is there an official statement from VMware with regard to a tight definition of the above.
My concern is that this is currently open to abuse, but in the future VMware will tighten up the definition and some customers who have deployed Fault Domains between racks that are in different buildings and 100s of metres apart will then have an un-supported configuration.
Many thanks as always
Mark
Hi Mark, you can create a 2-node stretched cluster + witness with the standard license. Duncan described the scenario here – http://www.yellow-bricks.com/2017/01/24/two-host-stretched-vsan-cluster-with-standard-license/. However if you try to add any more nodes to the stretched cluster, then it will fail. More than 2 nodes will require an enterprise licenses. If you try to do this with Fault Domains only, then yes, as you point out, a split brain will leave you with inaccessible VMs. We do not mention distances, but we do mention latency and bandwidth requirements for stretched cluster. One can assume that the same requirements exist for distances between fault domains.
Hi Cormac,
So are you saying that with Fault Domains there are no requirements with regard to distance and the hosts being connected to the same ToR switches (beyond the bandwidth latency requirements of a stretched cluster)?
In other words if the customer was happy to sacrifice the following they could build a pseudo stretched cluster with large number of nodes with just the Standard licence:
1. The witness host
2. Read locality
3. Preferred Site
I assume that when the next vSAN release is made available they would not get the Nested Fault Domains feature as this will require the Enterprise licence – same goes for any future enhancements with regard to Stretch Cluster technology.
It doesn’t seem ideal to me to build a pseudo stretched cluster using the Standard licence and I am surprised that VMware is not putting tighter restrictions around Fault Domains (i.e. the design goal as that hosts are in racks within metres of each other).
I would have thought there would have been greater differentiation between the two technologies as you can see some customers deploying the cheaper pseudo solution but coming unstuck if they have a link failure.
Many thanks
Mark
Yes – there are no distance requirements to the best of my knowledge. As you have listed, there are many disadvantages to going with that approach, such as 3 uniformly configured DCs to implement 3 x FDs, which becomes expensive, and inaccessible VMs if you decide to just implement 2 x FDs and have a link failure. Not a good idea.