vSAN Stretched Cluster – Partition behavior changes
My good pal Paudie and I are back in full customer[0] mode these past few weeks, testing out lots of new and upcoming features in future release of vSAN. Our testing led us to building a new vSAN stretched cluster, with 5 nodes on the preferred site, 5 nodes on the secondary site, and of course the obligatory witness node. Now, it had been a while since we put vSAN stretched cluster through its paces. The last time was the vSAN 6.1 release, which led us to create some additional sections on stretched cluster for the vSAN Proof Of Concept guide. You can find the vSAN POC guide on storage hub here. Anyway, to cut to the chase, we observed some new behavior which had us scratching our heads a little bit.
In our test, we decided we would isolate the witness from one of the data sites. Here is a sample pic, if you can imagine there being 5 nodes at each data site, rather than 2:
In the past when we did this test, the data site that could still communicate to the witness node formed a majority partition with the witness. If we blocked the witness from the preferred site, then the witness would bind with the secondary site to form a 6 node majority. Conversely, if we blocked the witness from the secondary site, then the witness would bind to the preferred site to form the majority. In this configuration, we would have expected the cluster to partition with a 6:5 node ratio.
This came as a surprise to us. What we now observe is that when we remove the witness from one of the data sites, the data sites form a partition, and the witness becomes isolated. So we end up with a 10:1 partition. It doesn’t matter if we isolate the witness from the preferred site, or from the secondary site, we end up with a 10:1 node ratio.
Paudie and I discussed this with one of the main vSAN engineers for stretched clusters, Mukund Srinivasan. Mukund explained that this was a behavior change to vSAN stretched cluster. Now if the witness is isolated from one of the sites, it does not bind to the other site to form a 6:5 partition split as before. The major change in behavior is that for a witness node to be part of the cluster, it must be able to communicate to the master AND the backup nodes. In vSAN stretched clusters, the master node is placed on the opposite data site to the backup node in normal operations. When we isolate the witness from either data site A or data site B, the witness is not able to communicate to both master AND backup any longer, so it is partitioned. Instead what happens is that both the data sites remain in the cluster, and we isolate the witness, so we end up with a 10:1 partition in our 11 node cluster.
[Update] This question arose internally, and I though it would be a good idea to add it to the post. Basically, what happens when there is a 1+1+1 setup (i.e. 2 node cluster). If one of the data sites fails, then the witness won’t be able to talk to both a master and a backup. What happens then? So in a 1+1+1 deployment, if the secondary node goes down, then there is no backup node in the cluster (just a master and the witness). In that case, the witness will join with the master. Only in cases where there is a backup node that the master node can talk to but the witness cannot talk to, will the witness remain out of the cluster as described above.
[Update #2] OK – just to make sure we were not seeing things, we deployed a 6.1 vSAN stretched cluster and tested once more. To our surprise, this also worked in the same way as I described above. So it would appear that we have always formed a cluster between the primary and secondary sites, if the witness cannot reach both a master AND a standby. I’m not sure what we observed in the past, but I can now confirm that we have always done things this way, and there is nothing new or changed in this behaviour. What we need to do now is to update our documentation.
I’m not into vSAN, but this sounds like a bug fix that’s finally been implemented.
If you have more than 3 quorum/voting nodes (as you should, in a large cluster like that) it is normal and expected that one of them should be kicked out if it misbehaves. That certainly shouldn’t take down half the site, as it did before.
No – not a bug. It was working as designed.
But I agree with you – the new behaviour is much more optimal.
Hi, this may generate a lot of data movement? The fault domain remains fonctionnal ?
What about the metadata ?
Not really. The original design/behaviour would have caused data movement. With this new behaviour, we isolate the witness components (which are quite tiny) and these components are never migrated (they always stay on the witness appliance). These may need to be updated when the witness appliance issue is resolved. Even deploying a new witness appliance to address an outage should result in minimal rebuild time.
Interesting, thx for sharing.
Ok, but how to create new objects to satisfy the policy if the FD3 is missing ? Where is created the witness object if FD3 (witness site) is missing ? We must force the creation of the object (raid0) ?
Yes – you have a number of options. You could use a policy that set FTT=0. You could also use force-provision in the policy to roll out a VM even when the policy requirements can’t be met. Ideally though, you would fix the FD3 issue before deploying new VMs, right? 😉
Hi Cormac, thanks for this sharing ! Can you give us the VSAN version when the behavior was modified ?
Hi Cédric – my understanding is that it appeared in vSAN 6.2.
Something to verify. I have oserved the behavior (previous behavior) with vSAN 6.2. We are on 6.2. I can’t retest the scenario again, because we have only a productive environment at the moment.
OK – I’m not in a position to test it either at the moment, but I understood it went into 6.2. Perhaps it is 6.5 then. I will have to check/test it at some point I guess. Thanks for clarifying Marco.
OK – so I just tested it on 6.2, and it has the new behaviour as far as I can see. After isolating the witness from one of the sites (removed the static routes), the nodes in each of the data sites stayed in the cluster, and the witness was left isolated.
Thank you for this article Cormac. I have observed your first szenario last week in our environment. And i thought, our network was missconfigured…. good to know.
Thanks Marco – glad it helped. We’re working on getting this new behaviour added to the vSAN POC (proof of concept) guide.