- Deploy the new witness appliance
- Add the witness appliance as an ESXi host to the vCenter that is managing the VSAN stretched cluster or 2-node VSAN deployment
- Disable the current VSAN stretched cluster/2-node config
- Reconfigure the stretched cluster/2-node config, selecting the same preferred and secondary hosts, and selecting the new witness
- Go to the health check, examine the data health and click on repair objects immediately to speed up the remediation process. Otherwise you will have to wait 60 minutes (default) for the clomd timeout to expire and start rebuilding the absent witness components on the new witness host.
Here are the steps in more detail.
Deploy a new witness appliance
I won’t go into this in any great detail as I already have a blog post on how to do that. The process is almost identical, with a few enhancements made to the overall process in VSAN 6.2. In a VSAN stretched cluster, where the witness and physical hosts communicate over L3, static routes will need to be added to allow the physical hosts and the witness host to communicate. This procedure is covered in the VSAN stretched cluster guide.
Add the witness to the VC managing VSAN
Not an awful lot to add here either. This step is also covered in the blog post mentioned in the previous step. However you might now see two witness in your inventory, the original and the new one. In my case the .19 witness is the newly deployed witness, and .26 is the current one due to be replaced.
Disable the current configuration
To remove the witness appliance, navigate to Cluster > Manage > Virtual SAN > Fault Domains and Stretched Cluster. Note the witness host is .26. On the right-hand side of the window, there is a “Disable” button. Click this, then click on Yes to confirm removing the witness.
Now it is time to rebuild the stretched cluster/2-node configuration, and select a new witness. The first step is to re-create the fault domains. This is easy since the original fault domain configuration is still visible, so it is simply a matter of moving the secondary hosts back into the secondary fault domain:
After the configuration has completed configuring, return to the health check test that we saw earlier. You can choose to leave the configuration as is, and after 60 minutes, the objects will automatically repair themselves. However you are running a higher risk for this 60 minutes (in other words, another failure could leave the VMs inaccessible), so you may want to decide to repair the objects immediately, which is an option offered by the health check. This will speed up the recovery of the absent witness objects and recreate them on the new witness host. Now if you click retest you should see the count for objects with reduced availability decrease and health objects increase: