Cloning and Snapshots on vSAN when policy requirements cannot be met

I was looking into some behavior recently to assist one of our partners. He described a situation that they observed during proof-of-concept testing. I thought it would be of benefit to highlight this behavior in case you also observe it, and you are curious as to why it is happening. Let’s begin with a description of the test. The customer has a 7-node vSAN, and has implemented RAID-6 erasure coding for all VMs across the board. The customer isolated one host, and as expected, the VMs continued to run without issue. The customer was also able to clones virtual machine on vSAN and take snapshots on vSAN. No problems there. Next the customer introduced another issue by isolating another host. This now meant that there were only 5 ESXi hosts running in the cluster. Again, as expected, this did not impact the VMs. They continued to run fine, and remain accessible (RAID-6 erasure coding allows VMs to tolerate 2 failures on vSAN). However the customer next went ahead and tried to do some snapshots and clones. He encountered the following error on trying to do so:

"an error occurred while taking a snapshot: out of resources"
"There are currently 5 usable fault domains. The operation requires 1 more 
usable fault domains."

Let’s explain why this occurs:

Let’s start with snapshots as that is easy to explain. Snapshots always inherit the same policy as the parent VMDK. Since the parent VMDK in this situation has a RAID-6 configuration which requires 6 physical ESXi hosts (4 data segments + 2 parity segments), and there are now only 5 hosts remaining in the cluster, we are not able to create an object with a configuration which adheres to the policy requirement. That’s straight forward enough.

But what about clones? What if we cloned, but we selected a different policy which did not require the same number of physical hosts?

Unfortunately his will not work either. When we clone a running VM on vSAN, we snapshot the VMDKs on the source VM before cloning them to the destination VM. Once again, these snapshots inherit the same policy as the parent VMDK, and do not use the policy of the destination VMDK. For example, here is a VM with 2 VMDKs that I cloned. Both have the default vSAN default datastore policy, as well as the VM home namespace object. I’m using RVC, the Ruby vSphere Console, available on all vCenter server:

/vcsa-06/DC/computers> vsan.vm_object_info Cluster/resourcePool/vms/win-2012-2/
VM win-2012-2:
  Namespace directory
    DOM Object: be71e758-6b8e-d700-23a0-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: hostFailuresToTolerate = 1, 
    spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, proportionalCapacity = [0, 100], 
    forceProvisioning = 0, CSN = 85, spbmProfileName = Virtual SAN Default Storage Policy, 
    SCSN = 87, cacheReservation = 0, stripeWidth = 1, spbmProfileGenerationNumber = 0)
      RAID_1
        Component: be71e758-260c-3201-35e6-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-j.rainpole.com, md: naa.500a07510f86d6ae, ssd: naa.55cd2e404c31f8f0,
        votes: 1, usage: 0.5 GB, proxy component: false)
        Component: be71e758-d9b0-3301-14b7-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-k.rainpole.com, md: naa.500a07510f86d6ca, ssd: naa.55cd2e404c31e2c7,
        votes: 1, usage: 0.5 GB, proxy component: false)
      Witness: 52e3ef58-f9db-bf03-5914-246e962f48f8 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)
  
  Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2.vmdk
    DOM Object: c071e758-3862-0523-ea3c-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: hostFailuresToTolerate = 1, 
    spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, proportionalCapacity = 0, 
    forceProvisioning = 0, CSN = 86, spbmProfileName = Virtual SAN Default Storage Policy, 
    SCSN = 85, cacheReservation = 0, stripeWidth = 1, spbmProfileGenerationNumber = 0)
      RAID_1
        Component: c071e758-9c2e-bd23-b93b-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-j.rainpole.com, md: naa.500a07510f86d684, ssd: naa.55cd2e404c31f8f0,
        votes: 1, usage: 8.6 GB, proxy component: false)
        Component: c2d31259-be67-d9aa-9520-246e962c23f0 (state: ACTIVE (5), 
        host: esxi-dell-l.rainpole.com, md: naa.500a07510f86d6cf, ssd: naa.55cd2e404c31f9a9,
        votes: 1, usage: 8.6 GB, proxy component: false)
      Witness: 6fd41259-7ab9-59e0-ddf8-246e962c23f0 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)

  Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2_1.vmdk
    DOM Object: 2072e758-4dd9-ebef-1221-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: hostFailuresToTolerate = 1, 
    spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, proportionalCapacity = 0, 
    forceProvisioning = 0, CSN = 80, spbmProfileName = Virtual SAN Default Storage Policy, 
    SCSN = 84, cacheReservation = 0, stripeWidth = 1, spbmProfileGenerationNumber = 0)
      RAID_1
        Component: 2072e758-ce95-11f1-6ab2-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-l.rainpole.com, md: naa.500a07510f86d695, ssd: naa.55cd2e404c31f9a9,
        votes: 1, usage: 40.3 GB, proxy component: false)
        Component: 2072e758-d8e5-12f1-8d83-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-i.rainpole.com, md: naa.500a07510f86d6ab, ssd: naa.55cd2e404c31ef8d,
        votes: 1, usage: 40.3 GB, proxy component: false)
      Witness: 55e3ef58-41dc-6e77-c6d6-246e962f4ab0 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)
 
Let’s now see what happens to these objects when I clone the VM while it is running:
/vcsa-06/DC/computers> vsan.vm_object_info Cluster/resourcePool/vms/win-2012-2/
VM win-2012-2:
  Namespace directory
    DOM Object: be71e758-6b8e-d700-23a0-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: CSN = 85, spbmProfileName = Virtual SAN Default Storage Policy, 
    stripeWidth = 1, cacheReservation = 0, hostFailuresToTolerate = 1, 
    spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, SCSN = 87, forceProvisioning = 0, 
    spbmProfileGenerationNumber = 0, proportionalCapacity = [0, 100])
      RAID_1
        Component: be71e758-260c-3201-35e6-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-j.rainpole.com, md: naa.500a07510f86d6ae, ssd: naa.55cd2e404c31f8f0,
        votes: 1, usage: 0.5 GB, proxy component: false)
        Component: be71e758-d9b0-3301-14b7-246e962f48f8 (state: ACTIVE (5), 
        host: esxi-dell-k.rainpole.com, md: naa.500a07510f86d6ca, ssd: naa.55cd2e404c31e2c7,
        votes: 1, usage: 0.5 GB, proxy component: false)
      Witness: 52e3ef58-f9db-bf03-5914-246e962f48f8 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)
 
  Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2-000001.vmdk
    DOM Object: 22331459-f0b3-0049-e0dc-246e962c23f0 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: CSN = 1, spbmProfileName = Virtual SAN Default Storage Policy, 
    stripeWidth = 1, cacheReservation = 0, hostFailuresToTolerate = 1, 
   spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, forceProvisioning = 0, 
   spbmProfileGenerationNumber = 0, proportionalCapacity = [0, 100])
      RAID_1
        Component: 22331459-dfdc-5a49-6955-246e962c23f0 (state: ACTIVE (5), 
        host: esxi-dell-l.rainpole.com, md: naa.500a07510f86d6cf, ssd: naa.55cd2e404c31f9a9,
        votes: 1, usage: 0.3 GB, proxy component: false)
        Component: 22331459-423a-5c49-7f7c-246e962c23f0 (state: ACTIVE (5), 
        host: esxi-dell-i.rainpole.com, md: naa.500a07510f86d6ab, ssd: naa.55cd2e404c31ef8d,
        votes: 1, usage: 0.3 GB, proxy component: false)
      Witness: 22331459-c247-5d49-9389-246e962c23f0 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)

    Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2.vmdk
      DOM Object: c071e758-3862-0523-ea3c-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
      proxy owner: None, policy: CSN = 86, spbmProfileName = Virtual SAN Default Storage Policy, 
      stripeWidth = 1, cacheReservation = 0, hostFailuresToTolerate = 1, 
      spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, SCSN = 85, forceProvisioning = 0, 
      spbmProfileGenerationNumber = 0, proportionalCapacity = 0)
        RAID_1
          Component: c071e758-9c2e-bd23-b93b-246e962f48f8 (state: ACTIVE (5), 
          host: esxi-dell-j.rainpole.com, md: naa.500a07510f86d684, ssd: naa.55cd2e404c31f8f0,
          votes: 1, usage: 8.6 GB, proxy component: false)
          Component: c2d31259-be67-d9aa-9520-246e962c23f0 (state: ACTIVE (5), 
          host: esxi-dell-l.rainpole.com, md: naa.500a07510f86d6cf, ssd: naa.55cd2e404c31f9a9,
          votes: 1, usage: 8.6 GB, proxy component: false)
        Witness: 6fd41259-7ab9-59e0-ddf8-246e962c23f0 (state: ACTIVE (5), 
        host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
        votes: 1, usage: 0.0 GB, proxy component: false)

  Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2_1-000001.vmdk
    DOM Object: 22331459-117f-016b-0dae-246e962c23f0 (v5, owner: esxi-dell-k.rainpole.com, 
    proxy owner: None, policy: CSN = 1, spbmProfileName = Virtual SAN Default Storage Policy, 
    stripeWidth = 1, cacheReservation = 0, hostFailuresToTolerate = 1, 
    spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, forceProvisioning = 0, 
    spbmProfileGenerationNumber = 0, proportionalCapacity = [0, 100])
      RAID_1
        Component: 22331459-b4d8-166d-04ac-246e962c23f0 (state: ACTIVE (5), 
        host: esxi-dell-k.rainpole.com, md: naa.500a07510f86d6b8, ssd: naa.55cd2e404c31e2c7,
        votes: 1, usage: 0.0 GB, proxy component: false)
        Component: 22331459-8b5b-186d-dd5e-246e962c23f0 (state: ACTIVE (5), 
        host: esxi-dell-j.rainpole.com, md: naa.500a07510f86d6ae, ssd: naa.55cd2e404c31f8f0,
        votes: 1, usage: 0.0 GB, proxy component: false)
      Witness: 22331459-6887-196d-841b-246e962c23f0 (state: ACTIVE (5), 
      host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
      votes: 1, usage: 0.0 GB, proxy component: false)

    Disk backing: [vsanDatastore (1)] be71e758-6b8e-d700-23a0-246e962f48f8/win-2012-2_1.vmdk
      DOM Object: 2072e758-4dd9-ebef-1221-246e962f48f8 (v5, owner: esxi-dell-k.rainpole.com, 
      proxy owner: None, policy: CSN = 80, spbmProfileName = Virtual SAN Default Storage Policy, 
      stripeWidth = 1, cacheReservation = 0, hostFailuresToTolerate = 1, 
      spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, SCSN = 84, forceProvisioning = 0, 
      spbmProfileGenerationNumber = 0, proportionalCapacity = 0)
        RAID_1
          Component: 2072e758-ce95-11f1-6ab2-246e962f48f8 (state: ACTIVE (5), 
          host: esxi-dell-l.rainpole.com, md: naa.500a07510f86d695, ssd: naa.55cd2e404c31f9a9,
          votes: 1, usage: 40.3 GB, proxy component: false)
          Component: 2072e758-d8e5-12f1-8d83-246e962f48f8 (state: ACTIVE (5), 
          host: esxi-dell-i.rainpole.com, md: naa.500a07510f86d6ab, ssd: naa.55cd2e404c31ef8d,
          votes: 1, usage: 40.3 GB, proxy component: false)
        Witness: 55e3ef58-41dc-6e77-c6d6-246e962f4ab0 (state: ACTIVE (5), 
        host: witness-02.rainpole.com, md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
        votes: 1, usage: 0.0 GB, proxy component: false)

What this shows us is that there are two additional snapshot objects added to this VM for the duration of the clone operation, one for each VMDK. Once again, these inherit the same policy as the parent VMDK. So if the VMDKs were deployed as RAID-6 objects, these snapshots used for cloning would also be RAID-6. In the scenario described earlier, if this was a 7 node cluster, and there were 2 failures in the cluster, the virtual machines would continue to remain available, but the host failures in the vSAN cluster would mean that you would not be able to do a live clone of any of the virtual machines, since they rely on snapshots, and snapshots inherit the same policy as their parent VMDKs. Once the clone operation completes, these additional snapshot objects are automatically removed.

Hope this helps to explain the scenario if you also encounter it.