Expanding on VSAN 2-node, 3-node and 4-node configuration considerations

I spent the last 10 days in the VMware HQ in Palo Alto, and had lots of really interesting conversations and meet-ups, as you might imagine. One of those conversations revolved around the minimum VSAN configurations. Let’s start with the basics.

2-node: There are two physical hosts for data and a witness appliance hosted elsewhere. Data is placed on the physical hosts, and the witness appliance holds the witness components only, never any data.
3-node: There are three physical hosts, and the data and witness components are distributed across all hosts. This configuration can support a number of failures to tolerate = 1 with RAID-1 configurations.
4-nodes: There are four physical hosts, and the data and witness components are distributed across all hosts. This configuration can support a number of failures to tolerate (FTT) = 1 with RAID-1 and RAID-5 configurations. This configuration also allows VSAN to self-heal in the event of a failures, when RAID-1 is used.

Let’s elaborate on that last point. What do we mean by self-healing? What we mean by this is that if a host (or some other infrastructure) fails, and there are free resources available in the cluster, VSAN can automatically fix the issue. With RAID-1 and FTT=1, there are 3 components: first copy of the data, second copy of the data and the witness. These are all located on separate hosts. If one of the hosts fails, then the component on that host can be rebuilt, and the VM is once again fully protected against a failure.

In the case of a 2-node or 3-node cluster, this is not possible. If a host fails, there is no place to rebuild the missing component. We do not place both copies of the data, or the data and a witness, on the same host – that is pointless, since if that host fails we have lost a quorum of components, and thus access to the VM’s object. This is one of the reasons VMware recommends a 4-node cluster at minimum for FTT=1, RAID-1. Note however that if RAID-5 configurations are chosen, you cannot have this self-healing behaviour with 4 nodes. Since a RAID-5 stripe is made up of 4 components, each component is placed on a different host. To have self-healing with RAID-5, you would need a minimum of 5 hosts in the cluster.

There is another reason for this recommendation of minimum hosts, and this is to do with maintenance mode. When placing a VSAN node into maintenance mode, you typically choose between “full data migration” and “ensure accessibility”. With “full data evacuation”, all of the components (data and witnesses) on the host that is being placed into maintenance mode are rebuilt on remaining nodes. This once again ensures that your VMs are fully protected when the maintenance operation is taking place, and they can survive another failure in the cluster even when a host is in maintenance mode. With FTT=1 and RAID-1, you once again need a minimum of 4 hosts to achieve this. With FTT=1 and RAID-5, you need a minimum of 5 hosts.

The other option, “ensure accessibility”, is the only option that can be used with 2-node and 3-node configurations since once again there is no place to rebuild the components. In these cases, “ensure accessibility” simply means that you have a single copy of the VM data available. For example, a VM deployed with FTT=0 residing on the host that is being placed into maintenance mode would have that component rebuilt on another remaining node. For VMs deployed with FTT=1, there should be no need to move any data if “ensure accessibility” is the option. However you are now running with only 2 out of 3 components, and another failure whilst a host is in maintenance mode can render your VMs inaccessible.

Hopefully this explains why we make some minimum recommendations on the number of hosts in a VSAN cluster. While you are fully supported with 2-node and 3-node configurations, there are definitely some availability considerations when using these minimum configurations.

8 Replies to “Expanding on VSAN 2-node, 3-node and 4-node configuration considerations”

Carlo Grossman says:

April 28, 2016 at 3:37 pm

What about in a stretched cluster? If you have 4 nodes, but 2 nodes at site A, and 2 nodes at site B, does the same apply? Can you place the witness at site A or site B, or does it need to be at site C?
1. Cormac says:
  
  April 29, 2016 at 2:41 pm
  
  The witness must be on site C in a stretched cluster setup.
K. Chris Nakagaki (@Zsoldier) says:

April 28, 2016 at 3:44 pm

Wouldn’t it make sense to add a witness component option to a 3-node cluster? My assumption being that I could put that witness component in an entirely different datacenter and get the benefits that come w/ a 4-node physical cluster.
1. Cormac says:
  
  April 29, 2016 at 8:40 am
  
  It has been discussed, but I’m not aware of any plans to change it right now.
Neil Cresswell (@NeilC_Cloud) says:

April 28, 2016 at 11:17 pm

Finally some common sense. I have argued with vmw staff for over 6 months on this very issue (see my linkedin post for a start) and was constantly told 2&3 node clusters are perfectly acceptable and no issues around the items you cover above.
1. Cormac says:
  
  April 29, 2016 at 8:46 am
  
  Thanks for the comment Neil. I guess it all boils down to the level of risk you are prepared to take. I also know that maybe the level of risk is not understood by everyone, so hopefully this posts helps there too.
Darshan Kolambkar says:

April 29, 2016 at 3:02 am

Nice write-up you cleared my doubt’s
Dan Lah (@INDStorage) says:

June 16, 2016 at 6:13 pm

Excellent info as always. I have been having a lot of discussions lately with colleagues that as a partner we are setting ourselves up for major customer satisfaction issues down the road if we recommend 4-node VxRails with all-flash and RAID5 erasure coding. I’m not comfortable with that setup for anything other than remote site and the customer needs to understand the risk. I’d rather have 6 smaller nodes with RAID6 than 4 larger nodes with RAID5. Even a 5 node with RAID5 makes me nervous as a lot can happen during the 60 minute timeout + time it takes to actually evacuate data during the self-healing process.