Taking snapshots with vSAN with failures in the cluster
I was discussing the following situation with some of our field staff today. We are aware that snapshots inherit the same policies as the base VMDK, so if I deployed a VM as a RAID-6, RAID-5, or a RAID-1, snapshots inherit the same configuration. However if I have a host failure in a 6-node vSAN running RAID-6 VMs, or a failure in a 4-node vSAN running RAID-5, or a 3-node vSAN running RAID-1, and I try to take a snapshot, then vSAN does not allow me to take the snapshot as there are not enough hosts in the cluster to honour the policy. This is an example taken from a 4-node cluster with a RAID-5 VM, and I intentionally partitioned one of the nodes. I then attempted to take a snapshot. I get a failure shown similar to the following.
What can I do to address this?
Fortunately, there is a work-around. One can create a new policy which is identical to the original policy of the VM/VMDK, but with one additional policy setting added: Force Provisioning set to Yes. You can now apply this new policy to the VM/VMDK that you wish to snapshot.
The original policy would look something like this:
The new policy would look something like this:
You may now apply this policy to the VM, and propagate it to all the objects. Here is the original RAID-5, and the new policy RAID-5-RP with the Force Provision has been selected.
After clicking the “Apply to all” button, then new policy is added to all the objects:
After the policy has been changed, the VM is still using RAID-5 with one object still absent (due to the failure).
With the policy with Force Provisioning set to Yes applied, you can now go ahead and snapshot the VM/VMDK. The snapshot will now be created as a RAID-0 object, not a RAID-5.
Once the underlying issue has been resolved and the fault has been rectified, the snapshot can have the the RAID-5 policy applied to make it highly available.
So what is the best practice? This would seem to be a concern because what happens if the failure happens in the middle of the night…I suspect then there would be snapshot failures occurring based on any running backup operations, putting data at risk for recovery objectives.
You are quite correct. If there is a failure, and there are no additional resources for vSAN to self-heal (e.g. RAID-6 configuration in a 6-node vSAN), then you will not be able to take snapshots. In that case, one would need to change the policy, and then re-run the backup as now the snapshots can be taken.
This is a little off topic, so I apologize in advance. I have five servers in one vsan cluster and one distributed switch. I have two disk groups per server, for a total of ten disk groups. I would like to have the first disk group in each server to contribute to one data store and the other disk group to another data store.
How can I accomplish this?
Not possible I’m afraid. The vSAN architecture creates a single disk group from all hosts/disk groups in the same cluster.
Again a little of topic, sorry, but may relate.
Can VSAN snapshots help to reduce rebuild times for maintenance activity?
I have an all SSD cluster with 300TB on VSAN 6.6 with the single largest VM at 135TB available to the OS. Data resilience on this VM is FTT=2 with erasure coding (RAID 6). The VM has a moderate write rate at about 200GB per hour, but it has a very sparse write pattern. This can lead to 10-15TB of rebuilds for a modest 45minute outage for a host consuming about 8 hours to rebuild before I can start my next host. My current strategy involves stopping the application, but I would like to improve the situation. Attempting to evacuate a host is not an option due to size and I am prepared to go from FTT=2 to FTT=1 for short periods of time.
My idea is to effectively seal the server with a snapshot (as I can do on my storage array) and have the writes directed to the snapshot that may take less time to rebuild during the maintenance process. Once maintenance is complete, the snapshot would be removed. I have been trying on small scale in a nested environment, but my results are inconclusive as I am not certain of the write patterns.
I am trying to assess if the strategy is worth testing at full scale while containing the risk to my business. Is there any assistance that you can provide as this knowledge I believe would be helpful to larger cluster users when doing maintenance.
It certainly sounds feasible Antoine, but it is not a procedure that I have heard of anyone trying, nor have I done any testing myself.
I’m assuming a cluster with >= 6 nodes.
You take a snapshot (this will replicate the VMDK policy, so will also be RAID-6).
You place a host into maintenance mode, choosing no data evacuation. This may also impact the snapshot.
Complete maintenance task.
Take host out of mmode.
Allow components to resync, and the thought is that since all I/O will be to snapshot, main components should be unaffected, and only changes to snapshot need to be synced.
Commit snapshot.
Did i understand the order correctly? Again, I’ve not heard of anyone trying this, nor have I tried it myself. What did you observe in the nested environment to make the results inconclusive?
Your understanding is spot on.
The nested environment is inconclusive as I have not been able to scale it up enough with the correct write patterns to show a realistic result. Indications are that the snaps will be simpler to manage but I was only dealing with a small number of VSAN objects as opposed to the hundreds of objects in production. Scale, unfortunately, is very significant in this test.
Further to the above, I am carrying out some small tests on the production environment, creating snapshots, running small workloads and deleting the snaps without putting a host in maintenance mode. As expected the snaps being only write redirect operations are significantly smaller, but it has exposed one more parameter. The write rate needs to be low enough for the snapshot subsystem to manage the consolidation rate back into the base objects when I request a snapshot deletion, or the consolidation appears to “pause” (or make little progress) while writes are occurring.
So if the objects being created for the snapshots have a direct correlation to the size of the snapshots (and hence lower rebuild times) and the write rate is manageable (How I calculate manageable Im not sure of yet and I am not aware of any method to tune the snapshot consolidation rate) there may be a chance.
Hi Cormac,
We are implementing a new vSAN Strecth cluster using 6+6+1, and 6.6.1 version, we are evaluated all the failures scenarios and we have a doubt about split brain situations. We will configure PFTT=1 and SFTT=1. Reading for a previous post, we understood that if we have a failure between inter-site link, all VMS are failling to preferred site using vmware HA. if i have VM running on secondary site and we loose inter-site link, what happend with the active data copy running on secondary site? i mean, if HA move VMs to preferred site, we loose data from the secondary VMs because data copy on preferred site are out of date, rigth?
In this scenario witness host is able to communicate with both sites
Thanks in advance
–
Hi Marcos, there really is no concept of “active data copy” in vSAN. Any write from a VM to disk is only acknowledged when it is committed to the copy on site A and on site B.
Therefore if HA moves VMs from site A to site B or vice-versa, the VM will always have the latest copy of the data. Replication of data between sites for vSAN can be thought of as synchronous.
Hi Cormac, first of all, thank you very much for your time and for answering my question even the blog entry is not related. I try to explain my question:
– VM running on HOSTA Preferred Site (each site has 3 hosts, A, B, C)
– VM running on HOSTD Secondary site (each site has 3 hosts D, E, F)
– i have a inter-site link failure but witness is able to communicate both sites. As per documentation, Master HOSTA ping HOSTD for heartbeat, if, has not response, try HOSTE and so on. If any of the hosts for secondary site responds, HOSTA design HOSB as Backup, so HA move all VM to preferred site.
My question is, what happend with running VMs on secondary site during heartbeat Operations? i Mean VMs running on HOSTA still running on HOSTA, but after HA move (we don´t have inter´site link, so replications are not occur) VMS running on HOSTD move to hostA.
based on your response, i understood that VMs running on HOSTD are freeze until HA moves to HOSTA because we haven´t got Replication, right?
Thanks in advace!
At the point where the inter-site link is lost, all VMs will run on Preferred Site. This is because witness will bind to this site, so VMs on this site have ‘quorum’. Any VMs that were on Secondary are restarted on Preferred.
ok, no more questions!
Thanks for your time