I was involved in some conversations recently on how the VAAI UNMAP command behaved, and what were the characteristics which affected its performance. For those of you who do not know, UNMAP is our mechanism for reclaiming dead or stranded space from thinly provisioned VMFS volumes. Prior to this capability, the ESXi host had no way of informing the storage array that the space that was being previously consumed by a particular VM or file is no longer in use. This meant that the array thought that more space was being consumed than was actually the case. UNMAP, part of the vSphere APIs for Array Integration, enables administrators to overcome tho challenge by telling the array that these blocks on a thin provisioned volume are no longer in use and that they can be reclaimed.
This conversation began with a brief chat where one of our partners noticed that an increased block count per iteration to the reclaim command makes a huge difference in time to reclaim. VMFS allocates blocks in multiples of 1MB and this is also the block count parameter assigned to the reclaim command. The number of blocks that is given as input is actually the number of VMFS blocks, and on a VMFS5 volume the block size is 1MB. The command in question is esxcli storage vmfs unmap with the -n option to specify the number of blocks.
We had a chat with some of our UNMAP experts. They agreed that the block count does indeed play significant role in performance.
The way UNMAP works is by allocating the said number of VMFS blocks to a temporary unmap file per iteration, issue an UNMAP on all those blocks in that iteration, and move on to the next set of blocks, until the entire volume is covered. So the more blocks you specify, the more blocks (and hence more physical space) that can be covered per iteration and therefore less work for UNMAP. Hence if the block count to reclaim is small, more UNMAP commands are required to release the space.
Let’s cover that in some more detail. Each UNMAP command can specify a maximum of 100 segment descriptors, each of which contains the starting LBA and the number of contiguous 512 byte blocks starting from that LBA. For a given number of VMFS blocks per iteration (consider 200 for example), the best case is that all those VMFS blocks are contiguous on disk, resulting in a single UNMAP command with a single segment descriptor that covers the entire 200MB of contiguous space.
In the worst case scenario, and this is going into some specifics of VMFS, out of a “cluster” of file blocks, every alternate block is allocated. This is the worst form of fragmentation, so for each VMFS cluster it results in a single UNMAP command with 100 segment descriptors, each of which covers an area of alternate 1MB blocks. The number of UNMAP commands that would be generated in this case depends on the number of VMFS blocks given as input. So for example, if you specified 1000 blocks, and they are all fragmented, it would cover at least 10 VMFS clusters and at most 1000 VMFS clusters, corresponding to 10 UNMAP commands with 100 segment descriptors each, or 1000 UNMAP commands with a single segment descriptor each. However, this is very worse case scenario and highly unlikely. The VMFS resource manager always tries to allocate a contiguous range of blocks to disk files (thin, thick, eagerzeroed thick) as discussed in this older but still informative blog post I wrote some time back on the vSphere blog.
If there is little to no fragmentation, and the bigger the number of blocks specified, the smaller the number of UNMAP commands we have to generate in order to cover the entire reclaimable area. If the space being reclaimed is not contiguous, then we may see additional UNMAP commands being issued, which will in all likelihood increase the time to reclaim the space.
Hopefully this might go some way to helping explain why you might observe different performance behaviour with various UNMAP commands, especially when larger block counts are used.