My good pal Paudie and I are back in full customer mode these past few weeks, testing out lots of new and upcoming features in future release of vSAN. Our testing led us to building a new vSAN stretched cluster, with 5 nodes on the preferred site, 5 nodes on the secondary site, and of course the obligatory witness node. Now, it had been a while since we put vSAN stretched cluster through its paces. The last time was the vSAN 6.1 release, which led us to create some additional sections on stretched cluster for the vSAN Proof Of Concept guide. You can find the vSAN POC guide on storage hub here. Anyway, to cut to the chase, we observed some new behavior which had us scratching our heads a little bit.
In our test, we decided we would isolate the witness from one of the data sites. Here is a sample pic, if you can imagine there being 5 nodes at each data site, rather than 2:
In the past when we did this test, the data site that could still communicate to the witness node formed a majority partition with the witness. If we blocked the witness from the preferred site, then the witness would bind with the secondary site to form a 6 node majority. Conversely, if we blocked the witness from the secondary site, then the witness would bind to the preferred site to form the majority. In this configuration, we would have expected the cluster to partition with a 6:5 node ratio.
This came as a surprise to us. What we now observe is that when we remove the witness from one of the data sites, the data sites form a partition, and the witness becomes isolated. So we end up with a 10:1 partition. It doesn’t matter if we isolate the witness from the preferred site, or from the secondary site, we end up with a 10:1 node ratio.
Paudie and I discussed this with one of the main vSAN engineers for stretched clusters, Mukund Srinivasan. Mukund explained that this was a behavior change to vSAN stretched cluster. Now if the witness is isolated from one of the sites, it does not bind to the other site to form a 6:5 partition split as before. The major change in behavior is that for a witness node to be part of the cluster, it must be able to communicate to the master AND the backup nodes. In vSAN stretched clusters, the master node is placed on the opposite data site to the backup node in normal operations. When we isolate the witness from either data site A or data site B, the witness is not able to communicate to both master AND backup any longer, so it is partitioned. Instead what happens is that both the data sites remain in the cluster, and we isolate the witness, so we end up with a 10:1 partition in our 11 node cluster.
[Update] This question arose internally, and I though it would be a good idea to add it to the post. Basically, what happens when there is a 1+1+1 setup (i.e. 2 node cluster). If one of the data sites fails, then the witness won’t be able to talk to both a master and a backup. What happens then? So in a 1+1+1 deployment, if the secondary node goes down, then there is no backup node in the cluster (just a master and the witness). In that case, the witness will join with the master. Only in cases where there is a backup node that the master node can talk to but the witness cannot talk to, will the witness remain out of the cluster as described above.
[Update #2] OK – just to make sure we were not seeing things, we deployed a 6.1 vSAN stretched cluster and tested once more. To our surprise, this also worked in the same way as I described above. So it would appear that we have always formed a cluster between the primary and secondary sites, if the witness cannot reach both a master AND a standby. I’m not sure what we observed in the past, but I can now confirm that we have always done things this way, and there is nothing new or changed in this behaviour. What we need to do now is to update our documentation.