vSAN Capacity Management in v7.0U1

With the release of vSAN 7.0U1, a major change was made with regards to what was termed “slack space” requirements. This basically referred to how much space should be set aside on the vSAN datastore for operational and rebuild purposes. I have had a few queries about this recently, so I thought I would take the opportunity to highlight some of the capacity management features now available in vSAN.  This would also be a good time to revisit the advanced options for Automatic Rebalance, as well as discuss the Reactive Rebalance features that we have had in vSAN for some time now. Let’s start with the Rebalance features, and then we can talk about the changes to the new capacity reservations that we set aside for both rebuild operations after failure, as well as transient operations such as policy changes, etc.

Automatic Rebalance

Automatic Rebalance in an Advanced Option and relates to the even distribution of vSAN object components.  Components are distributed around the vSAN cluster to various hosts, disk groups and physical disks. If this Advanced Option is enabled, and the algorithms associated with Automatic Rebalances determines that the system is unbalanced, it will initiate a rebalancing operation automatically. The threshold to initiate rebalance is set to 30% by default, which means that if any two disks have this variance (one is 30% more loaded than the other), rebalancing of components begins. Rebalancing will continue until the variance reaches half of the set threshold value, i.e. 15% by default (or until Automatic Rebalance is disabled). However the rebalancing threshold can be set to anywhere between 20% and 75% variance, depending on how aggressively you wish to balance your cluster capacity. The last Advanced Option below shows how to enable and set the threshold.

Reactive Rebalance

vSAN initiates a reactive rebalance automatically when a physical disk reaches an 80% full threshold. The reactive rebalance algorithm will try to place vSAN components on other disks in the cluster in an attempt to bring all disk capacity usage below the 80% threshold. When there are many physical disks in an environment, with lots of small components and lots of free capacity on the remaining disks in a cluster, this is fairly seamless. The disk selection algorithm will try to spread the new components as fairly as possible to keep all the disks uniformly filled. However some environments may have very large VMDKs and only a small number of physical disks. in these cases, when a request is made to change the object policy dynamically, this may requires a very large new VMDK to be instantiated. vSAN not only needs to instantiate this new VMDK, but if it pushes any physical disk over the 80% capacity threshold, it also starts to reactively rebalance the new components depending on the overall cluster state.

Even though we have a new Capacity Reserve in 7.0U1, there has been no change made to the 80% reactive rebalance threshold.

Capacity Reserve

There is no longer a need to set aside the 25-30% slack space on the vSAN datastore that we recommended previously. We now have the ability to control the amount of capacity that is reserved for both rebuild operations and transient operations such as temporary capacity need to do a policy change on an object. By default, the Capacity Reserve feature is disabled, meaning all vSAN capacity is available for workloads. However, once enabled, there are now 2 advanced parameter to control how much of the vSAN datastore should be set aside for rebuild and operations. The first parameter is Host Rebuild Reserve. This reservation is set to one host worth of capacity. This means that if one host in the vSAN cluster fails and no longer contributes storage, there is still sufficient capacity remaining in the cluster to rebuild and re-protect all vSAN objects.  This reservation is based on the N+1 host count recommendation. While the % value is high in small clusters (e.g. 25% in a 4-node cluster), it decreases significantly as a percentage of the overall cluster capacity as the number of hosts in vSAN cluster increases (single digit % of capacity values for clusters > 12 nodes).

The second parameter is Operations Reserve. This is capacity that is set aside for internal, temporary vSAN operations. The example I use to explain this is a change in an object’s policy, where a new policy assigned to a vSAN object requires a new object with a completely different layout. In this case, vSAN instantiates the new object and synchronizes the contents with the old object, before finally discarding the original object with the old layout. Obviously this requires some additional space on the vSAN datastore while 2 copies of the object exists, and this is what the Operations Reserve is used for. This reserve can be set between 0-25%. To enable Capacity Reserve, navigate to Cluster > Configure > vSAN > Services > Enable Capacity Reserve. Here is an example of enabling Capacity Reserve on my 4-node vSAN cluster.

Once you edit the Enable Capacity Reserve, you will be shown how much of the total vSAN datastore capacity (by default) is allocated to each reserve, as shown below. This is also available in the vSAN Capacity View. You can then decide if you wish to enable the reservations or not. Note that if the used space on the vSAN datastore exceeds the suggested Operations threshold, vSAN might not operate properly. Similarly, if the used space on the vSAN datastore exceeds the suggested Host rebuild threshold, then vSAN may not be able to protect all of your vSAN objects in the case of a host failure.

You cannot simply enable Host rebuild reserve on its own when enabling Capacity Reserve. You must enable Operations reserve and then you can opt to enable Host rebuild reserve as well, or leave it disabled. Note however that the 10% overhead of the Operations threshold is taken into consideration before the Host rebuild reserve is taken into account. Thus, in a small 4 node vSAN cluster such as mine, the 10% Operations Reserve is first calculated and accounted for, before the Host rebuild threshold is taken into account, as can be seen from the screenshots.

Hope this has given you a good understanding of the new control that you have available when deciding what reservations to set aside for rebuild and transient operations on vSAN. Do note however that the reserved capacity is not supported on vSAN stretched cluster, or on a vSAN cluster with fault domains including nested fault domains. It is also not available if you have a 2-node ROBO (Remote Office/Branch Office) vSAN cluster, or if the number of hosts in the vSAN cluster is less than four.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.