Which policy changes can trigger a rebuild on vSAN?

Some time ago, I wrote about which policy changes can trigger a rebuild of an object. This came up again recently, as it was something that Duncan and I covered in our VMworld 2017 session on top 10 vSAN considerations. In the original post (which is over 3 years old now), I highlighted items like increasing the stripe width, growing the read cache reservation (relevant only to hybrid vSAN) and changing FTT when the read cache reservation is non-zero (again only relevant to hybrid vSAN) which led to a rebuild of the object (or components within the object). The other policy change that I highlighted was increasing Object Space Reservation. Because of the queries we received, we did some further testing.

Changing the RAID Protection Mechanism

When I first wrote the article above in 2015, vSAN did not support RAID-5 or RAID-6. It only supported RAID-1 (mirroring) protection. However, if you decide to change a policy from RAID-1 to RAID-5 or RAID-6, or vice-versa, an object rebuild is required. The same consideration is true if you wish to go from RAID-5 to RAID-6, or vice-versa.

Increasing the Object Space Reservation

On testing this on the latest release of vSAN, an object rebuild only takes place when the Object Space Reservation of the object is 0, and a new policy is applied that contains an Object Space Reservation value greater than 0. At this point, we are essentially making the object thick rather than thin. Once the object is thick (i.e. it has an Object Space Reservation value greater than 0), increasing the OSR value did not initiate a new rebuild. Something may have changed in this behaviour since I last tested it, but in the latest release this is the current behaviour. Rebuilds only seem to take place now when OSR is initially 0, and a new non-zero OSR value is applied to the object. The output below is from a test I did in the lab where we can see the new components with a non-zero OSR. These are in a state of RECONFIGURING. The current ACTIVE components will be removed when the sync completes, and the RECONFIGURING components will become ACTIVE.

/localhost/CH-DC/vms> vsan.vm_object_info 10
VM centos-swarm-master:
    DOM Object: 5d3f845a-7387-934a-219e-246e962f4910 (v6, owner: esxi-dell-e.rainpole.com, proxy owner: None, policy: CSN = 4, spbmProfileId = 5e72fea5-8391-4677-ba5e-14c357faa109, proportionalCapacity = 20, spbmProfileGenerationNumber = 0, hostFailuresToTolerate = 1, spbmProfileName = OSR=20%)
        Component: 5d3f845a-43c6-9d4b-bf8f-246e962f4910 (state: ACTIVE (5), host: esxi-dell-e.rainpole.com, capacity: naa.500a07510f86d6bb, cache: naa.5001e820026415f0,
                                                         votes: 1, usage: 6.9 GB, proxy component: false)
        Component: 5d3f845a-d161-9f4b-3984-246e962f4910 (state: ACTIVE (5), host: esxi-dell-h.rainpole.com, capacity: naa.500a07510f86d6bf, cache: naa.5001e8200264426c,
                                                         votes: 1, usage: 6.9 GB, proxy component: false)
        Component: 8141845a-50fc-2cce-f012-246e962f4910 (state: RECONFIGURING (10), host: esxi-dell-g.rainpole.com, capacity: naa.500a07510f86d693, cache: naa.5001e82002675164,
                                                         dataToSync: 3.48 GB, votes: 3, usage: 3.7 GB, proxy component: false)
        Component: 8141845a-f925-2fce-f07c-246e962f4910 (state: RECONFIGURING (10), host: esxi-dell-e.rainpole.com, capacity: naa.500a07510f86d685, cache: naa.5001e820026415f0,
                                                         dataToSync: 3.93 GB, votes: 3, usage: 3.3 GB, proxy component: false)
      Witness: 8141845a-53ac-30ce-4899-246e962f4910 (state: ACTIVE (5), host: esxi-dell-h.rainpole.com, capacity: naa.500a07510f86d6bf, cache: naa.5001e8200264426c,
                                                     votes: 3, usage: 0.0 GB, proxy component: false)

Enabling/Disabling Checksum

This is another feature which was not available when I did the initial testing. If checksum is already ENABLED, there is no resync or rebuild activity when it is DISABLED. However, if checksum is DISABLED, and a new policy with checksum ENABLED is applied to a VM/VMDK, then a rebuild of the components is takes place. Again, I am not sure if this has always been the case, but this is how it behaves in the current release of vSAN.

The point of all this is to highlight that policy changes can be made on-the-fly to change your VM’s storage requirements, should your application requirements need it. However, it should be understood that changing policies on-the-fly like this requires additional capacity on the vSAN datastore, and can also impact vSAN performance since changes like this can instantiate a significant amount of rebuild/resync traffic on the vSAN network. This is especially true if you change a policy that impacts many VMs at the same time. I would urge vSAN users to consider policy changes a maintenance task, and plan accordingly.

  1. Could a vmdk expansion start a resync/rebuild process, or would this only happen in some specific situation, such as in the case of objects larger than 255GB?

    Or this should never happen, so if a resync started after an vmdk expansion was for another reason, like unbalanced disk uitilization?

    • I just did a quick test, and grew a 40GB VMDK to 300GB. All that happened was that new components were added to each side of my RAID-1 mirror. The usage on the new components was 0GB and I observed no rebuild or resync activity. I agree that in a system that was heavily utilized, we might re-balance when a VMDK is grown.

  2. Thank you Cormac.
    just asked this because i faced an issue at a customer recently, and he told that everything started after a vmdk expansion. Still trying to understanding what could be happened…

  3. We experienced same kind of large resync issue after expanding VMDKs on couple of very large Database VMs. About 3 TB added to two VMs and VSAN started resync with severely throttling the performance of those two VMs. Tried to bring down the resync components from 50 to a lower number nothing changed the performance. The resync is very slow from the start, after 24 hours stil showing 11TB of data to resync, it was about 25 TB when it started the resync. vMware support is engaged and active working on it.

Leave a Reply