VSAN 6.2 Part 1 – Deduplication and Compression

Now that VSAN 6.2 is officially launched, it is time to start discussing some of the new features that we have introduced into our latest version of Virtual SAN. Possibly one of the most eagerly anticipated feature is the introduction of deduplication and compression, two space efficiency techniques that will reduce the overall storage consumption of the applications running in virtual machines on Virtual SAN. Of course, this also lowers the economics of running an all-flash VSAN, and opens up all-flash VSAN to multiple use cases.

A brief overview of compression and deduplication on VSAN

Most readers are probably familiar with the concept of deduplication. It has been widely used in the storage industry for some time now. In a nutshell, deduplication checks to see whether a block of data is already persisted on storage. If it is, rather than storing the same block twice, a small reference is created to the already existing block. If the same block of data occurs many times, significant space savings are achieved.

In all-flash VSAN, which is where deduplication and compression are supported, data blocks are kept in the cache tier while it is active/hot for optimal performance. As soon as the data is no longer active (cold), it is destaged to the capacity tier. It is during this destaging process that VSAN does the deduplication (and compression) processing.

dedupe:compress

Deduplication on VSAN uses the SHA-1 hashing algorithm, creating a “fingerprint” for every data block. This hashing algorithm ensures that no two blocks of data result in the same hash, so that all blocks of data are uniquely hashed. When a new block arrives in, it is hashed and then compared to the existing table of hashes. If it already exists, then there is no need to store this new block. VSAN simply adds a new reference to it. If it does not already exist, a new hash entry is created and the block is persisted.

Another new space-saving technique in VSAN 6.2 is compression. VSAN uses the LZ4 compression mechanism, and it works on 4KB blocks. If a new block is found to be unique, it also goes through compression. If the LZ4 compression manages to reduce the size of the block to less than or equal to 2KB, then the compressed version of the block is persisted to the capacity tier. If compression cannot reduce the size to less than 2KB, then the full-sized block is persisted. We do it this way (deduplication followed by compression) because if the block already exists, then we don’t have to pay the compression penalty for that block.

Is this inline or post process?

Many readers who follow online community debates will be well aware of past heated discussions when it comes to whether a deduplication process is done inline or post process. I’m sure we will see something similar around VSAN. For me, this method is neither inline nor is it post process. Since the block is not deduplicated or compressed in cache, but it is when the block is moved to the capacity tier, we’ve decided to use the term “near-line” to describe this approach.

There is one major advantage with this approach; as your applications are writing data, the same block may be over-written multiple times in the cache tier. On all-flash VSAN, this block will be over-written numerous times in the high performance, high endurance cache tier/write buffer. Once the block is cold (no longer used), it is moved to the capacity tier. It is only at this point does it go through the deduplication and compression processing. This is a significant saving on overhead, as cycles are not wasted deduplicating and compressing a block that is overwritten immediately afterwards, or multiple times afterwards.

Enabling deduplication and compression

This is very straightforward, like one would expect from VSAN. Simply navigate to the VSAN management view, and here you can edit the option to enable these space savings techniques, and change it from disabled to enabled, as shown below:

enable-dedupe-comp

Allow Reduced Redundancy

In the above screen shot, you might ask what is that “Allow Reduced Redundancy” check-box about? Well, enabling deduplication and compression requires an on-disk format change. If you have been through this process before, either upgrading from VMFS-L (v1) to VirstoFS (v2), or even upgrading to VSAN 6.2 (v3), you will be aware of the methodology use. VSAN evacuates data from the disk groups, removes them and then rebuilds them with the new on-disk format, one at a time. Then we go through a rinse-and-repeat for all disk groups. The same applies to deduplication and compression. VSAN will evacuate the disk group, remove it and recreate it with a new on-disk format that supports these new features.

Now just like previous on-disk format upgrades, you may not have enough resources in the cluster to allow the disk group to be fully evacuated. Maybe this is a three-node cluster, and there is nowhere to evacuate the replica or witness while maintaining full protection. Or it could be a 4-node cluster with RAID-5 objects already deployed. In this case, there is no place to move part of the RAID-5 stripe (since RAID-5 objects require 4 nodes). It might also simply mean that you have consumed a significant amount of disk capacity with no room to store all the data on fewer disk groups. In any case, you still have the  option of enabling deduplication and compression, but with the understanding that you will be running with some risk during the process, since there is no place to move the components. This option will allow the VMs to stay running, but they may not be able to tolerate the full complement of failures defined in the policy during the on-disk format change for dedupe/compression. With this option, VSAN removes components from the objects, rebuilds the disk group with the new on-disk format, and rebuild the component before moving onto the next step.

Monitoring deduplication and compression space-saving

VSAN 6.2 has introduced a new section to the UI called Capacity views. From here, not only can administrators see the overheads of filesystem on-disk formats and features such as deduplication and compression, but you can also see where capacity is being consumed on a per-object type. I’ll write more about these new capacity views in another post. However, if you want to see the space-saving that you are achieving with deduplication and compression, this is the place to look. The new UI will also display how much space is required to inflate and deduplicated and compressed objects, should you decide to disable the space-saving features at some point in the future.

Deduplication and compression with Object Space Reservation

There is a major consideration if you already use object space reservation (OSR) and now wish to use deduplication and compression. In essence, objects either need to have 0% OSR (thin) or 100% OSR (fully reserved – thick). You cannot have any values in-between. If you already have VMs deployed with an OSR value that is somewhere in between, you will not be able to enable deduplication and compression. It will error as shown here:

OSR when dedupe enabled

You need to remove that policy from VMs if you wish to use deduplication and compression. Alternatively set it to a value of 0% or 100%

Other considerations

The first thing to note is that deduplication and compression is only available on all-flash VSAN configurations. The second this to note is that this feature will make you reconsider the way you scale up your disk groups. In the past, administrators could add and remove disks to disk groups as the need arose. While you can still add individual disks to a disk group as you scale up, the advice is, if you plan to use deduplication and compression, build fully populated disk groups in advance. Then enable the space efficiency techniques on the cluster. Because the hash tables are spread out across all the capacity disks in a disk group when deduplication and compression are enabled, it is not possible to remove disks from a disk group after the space savings features are enabled. It should also be taken into consideration that a failure of any of the disks in the disk group will impact the whole of the disk group. You will have to wait for the data in the disk group to be rebuilt elsewhere in the cluster, address the failing component and recreate the disk group, followed by a re-balance of the cluster. These are some of the considerations you will need to take into account from an administrators perspective when leveraging the saving capacities of deduplication and compression on VSAN.

22 Replies to “VSAN 6.2 Part 1 – Deduplication and Compression”

  1. Cormac,
    Is there a technical reason why de-duplication and compression is only supported/available on an All-Flash VSAN configurations?

  2. Hi Cormac,

    Can you help to clarify the following:

    1. What is the compression block size (i.e. does it try and compress a 32K block down to something made up of 2K blocks)?
    2. Where is the de-dupe hash/finger print database stored (i.e. RAM or SSD or both)?

    Many thanks as always
    Mark

    1. Hi Mark,

      1. Compression attempt to compress a 4KB block size to 2KB or less.
      2. Dedupe data and metadata (hashes) are stored in a stripe across all disks in a disk group.

  3. Hi Cormac, I see that new views have been added to vCenter in order to manage all the new capabilities of VSAN 6.2, I was wondering if vrops will receive a new management pack at the releases date in order to manage these new functionalities as well? While bugging you with this kind of question, will a new content pack for log insight will be release at ga for the analytics part?

    Thanks!

    1. I don’t know about vROps Christian, but I know that our engineering team is testing some new widgets for the VSAN 6.2 for Log Insight.

  4. Hi Cormac!

    Granted, hash-collisions are very unlikely. But this does not mean they are not existing. No hashing algorithm can guarantee unique hashes for every block. Is there any check included that could detect a collision?

    Regards
    Wolfgang

    1. No – there is nothing that checks a collision, but I believe most all-flash arrays don’t do this either due to the performance implications.

      To be honest, the chances of a collision are ridiculously low, and on VSAN it is even less since the scope of dedupe is the disk group, and is not global across the whole of the cluster.

      1. @Cormac: “but I believe most all-flash arrays don’t do this either due to the performance implications”

        Just to name 2 exceptions, Kaminario’s K2 AFA and EMC’s VNX-F both do byte-for-byte comparisons on de-dupe check to ensure that a collision didn’t happen and data isn’t lost as a result. Apparently EMC’s XIO and VMWARE’s VSAN don’t bother.

      2. “Deduplication on VSAN uses the SHA-1 hashing algorithm, creating a “fingerprint” for every data block. This hashing algorithm ensures that no two blocks of data result in the same hash, so that all blocks of data are uniquely hashed.”

        Your claim that the hashing algorithm ensures no two [different] blocks of data result in the same hash is inaccurate. It doesn’t ensure that at all. In fact even assuming the digest is calculated on something as small as a 512 byte disk block, and I assume you use a bigger frame but 512B makes the point, the number of unique blocks is 2^(512*8) or 2^4096. The SHA1 hash has a 160 bit output so has, at most, 2^160 unique outputs. Thus on average there is 2^4096/2^160 (=2^3936) unique inputs for each possible unique output.

        2^3936 is approximated by 10^1184 (10 followed by 1183 zeros). This is far from ‘uniquely hashed.’

        What proponents of SHA-1 as a data index do claim is that SHA-1 makes it statistically unlikely that two unique frames have the same digest (hash). But this claim isn’t be supported by the mathematics, because no one knows the actual randomness of SHA-1. At best collisions are unlikely with 1 in 2^80 probability, but this assumes perfect randomness, which again is completely unknown. It could be very much less, so much less that collisions aren’t unlikely enough to prevent false positive conclusions of input uniqueness and resulting data loss.

        Without knowing the randomness (technically the probability mass function) of SHA1 you can’t say it’s safe to index data with it. That’s why a byte-for-byte comparison is necessary to ensure storage safety. Else it’s just a lottery ticket with unknown odds.

  5. Hi Cormac,

    With All-Flash VSAN performance is no longer an issue for most customers, but capacity becomes the problem.

    I have been looking through the partner portal and whitepapers do see if I could find something that would help me accurately size usable capacity after de-duplication and compression.

    I remember Duncan mentioning something about 3:1, but is there anything a bit more scientific (i.e. a tool) that will allow us to come up with a more accurate figure taking into account the customers actual data profile.

    If it is not available now, is it planned?

    Many thanks as always
    Mark

  6. Hi Cormac – any chance you could provide a response to the above question on de-dupe/compression ratio sizing – many thanks as always Mark

    1. We don’t have any such tooling I’m afraid. We can only display dedupe/compression statistics after it is enabled and working

  7. Hi Cormac,

    What are the recommendations around sizing? Clearly for All-Flash VSAN the challenge is not performance it is the cost per effective TB. If we go with an overall figure of 3:1 it is a no-brainer compared to Hybrid, but at 2:1 or less it might be looking a little expensive.

    In order to come up with an accurate estimate we need to have a profile of the customers data and apply a different reduction ratio to each data type to come up with an overall blended ratio. A customer with mostly media files is probably going to get close to 1:1, whereas a VDI customer will probably get greater than 5:1.

    How are VMware SEs sizing all-flash VSAN?

    Many thanks as always
    Mark

      1. Not that I can find – I have also looked in the “VMware Virtual SAN 6.2 Space Efficiency Technologies” paper and nothing in there either – unless I am missing something!!!

        Just to note many of the storage array vendors have the same problem – no real tools or guidelines on how to size effective capacity.

Comments are closed.