2-node vSAN – witness network design considerations

Cormac

7 years ago

It seems that 2-node vSAN for ROBO (remote office/branch office) deployments are becoming more and more popular. The fact that one can now connect the 2 vSAN hosts at the remote office directly back-to-back without needing a 10Gb switch has reduced the cost extensively. And with the introduction of a vSAN Enterprise for ROBO license edition with vSAN 6.6.1, you get the full feature set of vSAN on 2-node deployments. This new edition builds on the vSAN Advanced edition, and enables the use of features like native encryption and stretched clusters on a per-VM pricing model for smaller sites.

The purpose of this post is to describe some of the more common questions that we get these days about 2-node deployments, which is namely about networking between the remote site and the main DC back at HQ. In most cases, the main DC back at HQ will have the vCenter Server for managing the vSAN cluster as well as the witness appliance, so network connectivity is needed between the HQ and remote site for both the vSAN network and the management network.

These days, 2-node ROBO implementations typically implement the separation of the witness traffic from the vSAN traffic. This means that the vSAN data only flows over the direct connect VMkernel interfaces between the 2 vSAN nodes at the remote site, while the witness traffic (which is minimal) is routed back to the witness appliance residing in the main DC back at HQ via another interface. This WTS – short for Witness Traffic Separation – is discussed in more detail in the vSAN Stretched Cluster and 2-node guide. Duncan has already written an article answering another common question about whether all witness traffic are able to share the same VLAN between remote sites and HQ if required (the answer is yes).

Duncan also states something else that is important to know for ROBO deployments and witness traffic; ROBO locations must send the witness traffic over L3. Another thing to note is that there is no multicast used for witness traffic. Witness traffic always has been unicast. So no need to worry about routing multicast with PIM, etc.

In the layout diagram taken from his post, all of the remote sites have unique VLANs for both the management network and the witness network. These network are L3, and require static routes created.

Another setup may be where the management network is part of a stretched L2 network, that can reach the remote sites. It’s quite unlikely this will be the case when you have a lot of remote sites, but it might be something you may come across when there is only one, or a very small number, of remote sites. If you have multiple subnets routed to each site for the witness traffic, you may have a configuration which looks like this.

Similarly, if the remote sites have a single routed subnet, the witness networks may all have to share the same VLAN once more, as Duncan mentioned in his post referenced earlier. This may look something like the following:

Of course, here I have drawn the management network with their own separate VLANs per site. But like in the previous topologies, the management network could again be a stretched L2.

Now there is another scenario that we get quite a few questions on, and this is customers may only have a single subnet routed between the main DC and the remote sites. In this case, the management network of the hosts may be tagged with the witness traffic, which is very light anyway. So in this example, the witness traffic and the management traffic share the same VLAN, and only one subnet is routed. Back at the main DC back at HQ, the witness appliance may be modified so that a single interface can be used for both management and witness traffic. This is quite straight forward to do, and means that the original portgroup created at deployment time for vSAN traffic on the witness appliance may be removed.

We have spoken about this to our engineering and product management teams to make sure there are no support issues around this approach, and they have confirmed to us that there are no issues with this design, should a customer wish to implement it. We are now updating our official docs to call this out as a supported configuration for 2-node / ROBO vSAN deployments.