VSAN 6.2 Part 9 – Replacing the witness appliance

Cormac

8 years ago

There might be a reason in VSAN stretched cluster environments or in 2-node VSAN ROBO deployments to change the witness appliance. The one thing to keep in mind is that you must use a witness appliance that has the same on-disk format as the rest of the disk groups in the cluster. Right now, there is a 6.1 version of the appliance and a 6.2 version of the appliance, so make sure that you select the correct one. Replacing the current witness with a new witness is very straight forward, and the tasks can be summarized as follows:

Deploy the new witness appliance
Add the witness appliance as an ESXi host to the vCenter that is managing the VSAN stretched cluster or 2-node VSAN deployment
Disable the current VSAN stretched cluster/2-node config
Reconfigure the stretched cluster/2-node config, selecting the same preferred and secondary hosts, and selecting the new witness
Go to the health check, examine the data health and click on repair objects immediately to speed up the remediation process. Otherwise you will have to wait 60 minutes (default) for the clomd timeout to expire and start rebuilding the absent witness components on the new witness host.

Here are the steps in more detail.

Deploy a new witness appliance

I won’t go into this in any great detail as I already have a blog post on how to do that. The process is almost identical, with a few enhancements made to the overall process in VSAN 6.2. In a VSAN stretched cluster, where the witness and physical hosts communicate over L3, static routes will need to be added to allow the physical hosts and the witness host to communicate. This procedure is covered in the VSAN stretched cluster guide.

Add the witness to the VC managing VSAN

Not an awful lot to add here either. This step is also covered in the blog post mentioned in the previous step. However you might now see two witness in your inventory, the original and the new one. In my case the .19 witness is the newly deployed witness, and .26 is the current one due to be replaced.

Disable the current configuration

To remove the witness appliance, navigate to Cluster > Manage > Virtual SAN > Fault Domains and Stretched Cluster. Note the witness host is .26. On the right-hand side of the window, there is a “Disable” button. Click this, then click on Yes to confirm removing the witness.

At this point, once the witness has been removed, all of the witness components are now absent, so if any objects are queries, or the health check is examined, the data check should show this failure:

Reconfigure with new witness

Now it is time to rebuild the stretched cluster/2-node configuration, and select a new witness. The first step is to re-create the fault domains. This is easy since the original fault domain configuration is still visible, so it is simply a matter of moving the secondary hosts back into the secondary fault domain:

Now the next step is to select the new witness. This time I select the .19 witness host, and not the .26 witness host. The rest of the steps are the same as setting up the witness for the first time, such as creating a disk group, etc.

Repair Objects Immediately

After the configuration has completed configuring, return to the health check test that we saw earlier. You can choose to leave the configuration as is, and after 60 minutes, the objects will automatically repair themselves. However you are running a higher risk for this 60 minutes (in other words, another failure could leave the VMs inaccessible), so you may want to decide to repair the objects immediately, which is an option offered by the health check. This will speed up the recovery of the absent witness objects and recreate them on the new witness host. Now if you click retest you should see the count for objects with reduced availability decrease and health objects increase:

When all objects show up as healthy, the replacement process can be considered complete.