Improved block allocation mechanism on VMFS-6

Along with other improvements to VMFS-6, there is also a new block allocation mechanism which aims to reduce lock contention between hosts sharing the same VMFS-6 filesystem. To understand how lock contention could arise, it is important to understand that resources on VMFS are grouped into resource clusters; this is the same for VMFS-6 and earlier versions on VMFS. Resources could be file descriptors, sub-block, file block, pointer blocks, and so on. Historically, we have always tried to allocate different resource clusters to different ESXi hosts, which meant that only VMs running on the same host shared resources within the same resource cluster. When there was a need to extend/grow VMDKs, it was all managed by the same host, avoiding any sort lock contention for resources within the same resource cluster. However, if one of the VMs was vMotion’ed to another host, we could have VM A on host A and VM B on host B sharing resources within the same resource cluster. If both host A tried to grow VM A at the same time host B tried to grow VM B, we could get lock contention as they both try to consume the same resource and update the metadata (certain characteristics would need to be updated in the metadata to reflect the new size of the file). This is the situation that the new improvements to the block allocation mechanism are working to address.

Let’s use a few diagrams to explain this situation a little better. Here we have a very simplified representation of a resource group. (L is short for Lock, M is short for Metadata, and R is short for Resource)

In the resource cluster, lets assume that there are two thin provisioned VMs, A & B. Both reside on the same host, and already have a number of blocks/resources consumed.

Now let’s assume that VM B gets migrated to another host (perhaps due to some DRS load balancing). Now VM A is on host A, and VM B is on host B. But the VMs continue to share the same resource cluster. At the same point in time, there is a requirement to grow both VMs A & B. Host A will try to consume the next resource (R9) at the same time as Host B tries to consume the same resource. They both race to lock L9 to modify the metadata M9 for resource R9, and thus contention arises.

Again, this is a very simplified representation, but hopefully you get the idea. Of course there could very many hosts contending for resources, not just 2, if we take a worse case scenario. The bottom line is that in VMFS-5, resource clusters tend to have host affinity, which avoids contention until such as time as a VM is migrated to another host. This can then lead to contention issues. In VMFS-6, we take a different approach. When a new file is created, we examine the existing resource clusters. If a resource cluster has any active users, meaning there is already a file using this resource cluster, that resource cluster will not be picked for the allocation of resource for this new file. Instead, the block allocation logic now looks for inactive resource clusters that are not in use by any other files. Essentially, what we have done is moved from a host affinity model for resource clusters to a VM affinity model for resource clusters. This should also mean that even in the event if a vMotion of a VM to a different host from where its file were created, VMFS-6 should not have multiple hosts contending for resources within the same resource cluster, since we are aiming to have the resource cluster only used by a single VM/file.

Of course, if all of the resource clusters become active/have files consuming resources, then we look for resource clusters with the least number of users (we are using a simple reference count mechanism to track this). This should also help keep the number of lock contention issues to a minimum, if they arise at all, on VMFS-6.