VSAN Part 21 – What is a witness?
At this stage, VSAN has only been in GA for a number of weeks, even though many of us here at VMware have been working on it for a year or two (or even more). Sometimes when we get into explaining the details of storage objects, components, etc, we forget that this is all so new for so many people. In a recent post, someone asked me to explain the concept of a witness on VSAN. Looking back over my posts, I was surprised to realize that I hadn’t already explained it. That is the purpose of this post – explain what a witness disk is in VSAN, and what role it provides.
In order to understand witness, some familiarity with clustering can help. Many readers will already be familiar with basic clustering technology. If we take Microsoft Clustering (MSCS) for example, and we take a simple two node cluster, this also has the concept of a quorum or witness disk. Each host/node has a vote, as does the witness disk. When one node fails, or there is a split-brain scenario where the nodes continue to run but can no longer communicate to each other, the remaining node or nodes in the cluster race to place a SCSI reservation on the witness disk. The node which wins now has two votes – its own vote and the witness vote. Therefore it is said to have quorum, and either continues to provide the clustered service, or takes over the running of the service from the node which failed.
So lets talk about witness disks in the context of VSAN. There are some similarities, but there are also some major differences to what we discussed in the previous paragraph. The first thing to highlight is that VSAN witness disks are used to provide an availability mechanism to virtual machines – they are not there to provide an availability mechanism to ESXi hosts/nodes participating in the VSAN cluster. This is a misconception I have come across in the past.
A second thing to highlight is that greater than 50% of the components which go to make up a virtual machine’s storage object must be available in order for that object to remain accessible. If less than 50% of the components of an object are available across all the nodes in a VSAN cluster, that object will no longer be accessible on a VSAN datastore. Witnesses play an import role in ensuring that more than 50% of the components of an object remain available. You can read more about components and objects in this earlier post.
Now lets take a simple virtual machine deployed on a VSAN datastore. Even if we do not create a policy and use the default policy, we will get a Number of Failures to Tolerate (FTT) =1 capability for the virtual machine disks (VMDKs) deployed on the VSAN datastore. What this means is that this VMDK will have two copies/replicas created, each copy located on a different ESXi host so that a single failure in the cluster ensures that a copy of the data is still available. Each of these copies or replicas is a component of the VMDK storage object. Now there are two questions to ask; how does VSAN handle split-brain/network partitioning and secondly how can we ensure that there are 50% of the components for this VMDK object available when a host/disk does fail? This is the role of the witness.
As well as creating two replica copies of the VMDK, a third component of the object is also created. This is the witness disk. It is purely metadata, and only consumes 2MB of disk space. Now, in a 3 node cluster, if the replicas of the VMDKs are placed on host1 and host2, the witness is placed on host3. This means that if any single host fails, we still have a copy of the data, and we still have greater than 50% of the components available. If there is a network partition or split brain, with 2 nodes on one side of the partition and one node on the other, there will again be a partition that holds greater than 50% of the components. Here is a representation of a VMDK (hard disk 1) on a VM which has Number of Failures To Tolerate set to 1.
There are three components in total; 2 replicas and the witness. All 3 components are placed on different hosts in the cluster.
Now this is the simplest example. Many of you who use the StripeWidth capability or set FFT to a value greater than 1 will have observed numerous additional witnesses being created. The point to keep in mind is that the number of witnesses will change and their distribution will be different based on the capability requirements placed in a virtual machine’s VM storage policy.
Thanks for the explanation. I agree that this part has not been explained enough.
Would love to see a follow-up with more examples of witness behavior when FTT numbers are larger than 1. Duncan has a good post on http://www.yellow-bricks.com/2013/10/24/4-minimum-number-hosts-vsan-ask/ that you perhaps could expand on.
I’ll see what I can do. The problem is that the number of witnesses can change depending on the policy, so it would be impossible to capture all possible combinations.
I have some question:
– Why FFT = 1 then have 1 witness, but FFT = 1 and Stripe = 2 then have 3 witness?
– If FFT = 1 and Stripe =2. A Replica of VM striped on host 1 and host 2, so a part striped on host 1 of a replica is a component or both part striped on host 1 and host 2 of replica is a component?
– What is the happen when host contain witness fail?
– Witness contain information of VM and replica and place of VM and replica?
– When 1 host failed then VM will restarted on the other or HA will work it?
Phew – lots of question. I will probably need another blog post to answer these as they would take too much space to explain.
If host containing the witness fails, nothing happens. You still have full set of data, and greater than 50% of components. If it is a host failure, and the failure last longer than 60 minutes, witness will be automatically rebuilt.
When a host fails, yes, HA can restart VMs on other hosts in the VSAN cluster.
However, with two replicas there is no way to differentiate between a network partition and a host failure. (In http://www.vmware.com/files/pdf/products/vsan/VMware_Virtual_SAN_Whats_New.pdf).
How is witness in Vsan to differentiate between a network partition and a host failure?