Taking snapshots with vSAN with failures in the cluster

I was discussing the following situation with some of our field staff today. We are aware that snapshots inherit the same policies as the base VMDK, so if I deployed a VM as a RAID-6, RAID-5, or a RAID-1, snapshots inherit the same configuration. However if I have a host failure in a 6-node vSAN running RAID-6 VMs, or a failure in a 4-node vSAN running RAID-5, or a 3-node vSAN running RAID-1, and I try to take a snapshot, then vSAN does not allow me to take the snapshot as there are not enough hosts in the cluster to honour the policy. This is an example taken from a 4-node cluster with a RAID-5 VM, and I intentionally partitioned one of the nodes. I then attempted to take a snapshot. I get a failure shown similar to the following.

What can I do to address this?

Fortunately, there is a work-around. One can create a new policy which is identical to the original policy of the VM/VMDK, but with one additional policy setting added: Force Provisioning set to Yes. You can now apply this new policy to the VM/VMDK that you wish to snapshot.

The original policy would look something like this:

The new policy would look something like this:

You may now apply this policy to the VM, and propagate it to all the objects. Here is the original RAID-5, and the new policy RAID-5-RP with the Force Provision has been selected.

After clicking the “Apply to all” button, then new policy is added to all the objects:

After the policy has been changed, the VM is still using RAID-5 with one object still absent (due to the failure).

With the policy with Force Provisioning set to Yes applied, you can now go ahead and snapshot the VM/VMDK. The snapshot will now be created as a RAID-0 object, not a RAID-5.

Once the underlying issue has been resolved and the fault has been rectified, the snapshot can have the the RAID-5 policy applied to make it highly available.

  1. So what is the best practice? This would seem to be a concern because what happens if the failure happens in the middle of the night…I suspect then there would be snapshot failures occurring based on any running backup operations, putting data at risk for recovery objectives.

    • You are quite correct. If there is a failure, and there are no additional resources for vSAN to self-heal (e.g. RAID-6 configuration in a 6-node vSAN), then you will not be able to take snapshots. In that case, one would need to change the policy, and then re-run the backup as now the snapshots can be taken.

  2. This is a little off topic, so I apologize in advance. I have five servers in one vsan cluster and one distributed switch. I have two disk groups per server, for a total of ten disk groups. I would like to have the first disk group in each server to contribute to one data store and the other disk group to another data store.

    How can I accomplish this?

Leave a Reply