Snapshot Consolidation changes in vSphere 6.0

This is something I only learnt about very recently, and something I was unaware of. It seems that we have made a major improvement to the way we do snapshot consolidation in vSphere 6.0. Many of you will be aware of the fact that when they VM is very busy, snapshot consolidation may need to go through multiple iterations before we can successfully complete the consolidation/roll-up operation. In fact, there are situations where the snapshot consolidation operation could even fail if there is too much I/O.

What we did previously is used a helper snapshot, and redirected all the new I/Os to this helper snapshot while we consolidated the original chain. Once the original chain is consolidated, we then did a calculation to see how long it would take to consolidate the helper snapshot. It could be that this helper snapshot has grown considerably during the consolidate operation. If the time to consolidate the helper is within a certain time-frame (12 seconds), we stunned the VM and consolidated the helper snapshot into the base disk. If it was outside the acceptable time-frame, then we repeated the process (new helper snapshot while we consolidated original helper snapshot) until the helper could be committed to the base disk within the acceptable time-frame.

We retried this process for a defined number of iterations, but we have had situations where, due to the amount of I/O in flight, we could never successfully consolidate the snapshot chain and helpers.

This is very reminiscent of the Storage vMotion mechanism we had in vSphere 4.0 which used a feature called Changed Block Tracking (CBT). I wrote about this on the vSphere Storage blog some time back. CBT keeps track of which disk blocks changed after the initial copy. We then recursively went through one or more copy passes until the number of changed blocks was small enough to allow us to switch the running VM to the destination datastore using the Fast Suspend/Resume operation.

In 5.0, we improved on Storage vMotion by doing a Storage vMotion operation in a single pass rather than multiple iterative copy passes. Storage vMotion in vSphere 5.0 uses a new Mirror Driver mechanism to keep blocks on the destination synchronized with any changes made to the source after the initial copy. The migrate process does a single pass of the disk, copying all the blocks to the destination disk. If any blocks change after it has been copied, it is synchronized from the source to the destination via the mirror driver. There is no longer any need for recursive passes. This means that we now have a much shorter Storage vMotion operation as it can complete a migration in a single pass.

In vSphere 6.0 the snapshot consolidation process also uses the mirror driver. With the mirror driver mechanism, changes to the VM are written to the active VMDK and the base disk (while protecting write order) during consolidation. One should now hopefully see snapshot consolidations completing in 1 pass (with minimal or indeed no helper disks) and with hopefully a dramatically shorter stun time, and a much small chance of consolidation failure.

This is a very desirable improvement to the snapshot mechanism.