There have been some notable discussions about VMFS heap size and heap consumption over the past year or so. An issue with previous versions of VMFS heap meant that there were concerns when accessing above 30TB of open files from a single ESXi host. VMware released a number of patches to temporarily work around the issue. ESXi 5.0p5 & 5.1U1 introduced a larger heap size to deal with this. However, I’m glad to say that a permanent solution has been included in vSphere 5.5 in the form of dedicated slab for VMFS pointers and a new eviction process. I will discuss the details of this fix here.
What was the original issue?
In VMFS, indirect memory addressing uses a concept of a pointer block. Traditionally, the file descriptor on VMFS pointed to a pointer block which contained entries that pointed to the data blocks.
When we introduced support for a unified 1MB file block size with VMFS-5 in vSphere 5.0, we also introduced a double-indirect pointer block so that we could have very large VMDKs built with 1MB file blocks. Now the file descriptor pointed to a pointer block; the entries in this pointer block pointed to another pointer block, and this contained entries that pointed to the data block.
However, these pointer blocks were saved in an area called the PB Cache which was allocated from the VMFS heap. Unfortunately, we didn’t do a very good job of maintaining the PB Cache, which resulted in heap depletion and other undesirable side effects.
What have we done to fix it?
To fix this issue in vSphere 5.5, PB Cache is now maintained outside of VMFS heap so we no longer have to worry about heap depletion being an issue with vSphere 5.5. This also means that the default value of 256MB for VMFS heap should not need tuning or modifying going forward. I spoke with one of VMware’s core storage engineers, Bryan Branstetter, to get additional details. It seems that the PB Cache now has its own PB Cache slab; this is a new concept – a slab is different from a heap in that a slab is optimized to give out memory chunks that are all the same size. The benefit of a slab is that it is not subject to the typical fragmentation issues that occur with a heap. Since the PB Cache is always allocating ~4k chunks for the pointer blocks themselves, a slab is a much more efficient use of memory.
Bryan also told me that we have also introduced a new eviction policy for the PB Cache. The policy is based on a LRU (Least Recently Used) algorithm. In vSphere 5.5, if the PB Cache size exceed a certain threshold (which is tuneable), the eviction mechanism will kick in and start removing the oldest pages.
The size of the VMFS heap is now irrelevant with respect to how much open addressable space there can be.
A new tuneable parameter, MaxAddressableSpaceTB, is now used to size the PB Cache slab appropriately. As mentioned, this is maintained separately from the VMFS heap.
The parameter defines how many pointer blocks for open files are cached in memory before the eviction mechanism kicks in. The default value is set to 32TB, but this can be set up to a maximum value of 128TB. Note however that bumping the default from 32 to 128 will have an impact on the amount of memory that is consumed by the PB Cache slab.
The eviction process starts when the amount of open files reaches the threshold of 80% MaxAddressableSpaceTB. If the slab reaches 95% of capacity, pointer blocks will have to be evicted before new Pointer Blocks can be brought into the cache. It should be noted that MaxAddressableSpaceTB is not related to the size of the open file, but rather the size of the working set for that open file. The following example should make this concept clearer.
Bryan discussed the example where a customer has 3 virtual machines with different VMDK sizes and different working sets. Note that the VMs can be on one or multiple different datastores – the PB Cache is a host-wide cache that is shared across all datastores that are being accessed from the same ESXi host.
The working set represents portions of each disk that are typically accessed once the system is in steady state:
- VM A with a 40TB disk, working set is ~14TB
- VM B with a 30TB disk, working set is ~20TB
- VM C with 3 x 20TB disks, working set is ~15TB
If the customer powers on all of these VMs on the same host, the total of open VMDKs is 130TB, but the working set size is only 49 TB. With the default MaxAddressableSpaceTB setting of 32TB, the customer may not have the desired performance they are looking for because only 32TB of addresses will be cached for pointer block resolution. The system will be thrashing somewhat as it needs to evict the LRU PB Cache entries to disk before allowing new entries into the slab.
If the customer turns the MaxAddressableSpaceTB up to 50TB, background eviction will kick in at 80% of 50TB (i.e. 40TB) which means there will a performance improvement. There is still be some thrashing, but not as much as before. Turning it up to 60TB, background eviction will kick in at 80% of 60TB => 48TB, which is pretty close to the working set size. To be completely sure that the most frequently accessed addresses are cached all the time, the working set should fit within 80% of the MaxAddressableSpaceTB setting. For this customer, setting it to 64TB should be adequate (but again, this will require more of the host’s memory space).
As usual, the tradeoff here is the amount of memory consumed for the PB Cache vs. virtual machine performance. There is now no restriction on the amount of open files that a virtual machine can address; however a lot of PB Cache evictions may be happening with large address spaces which will impact virtual machine performance.