Virtual Volumes – A new way of doing snapshots

Cormac

9 years ago

I learnt something interesting about Virtual Volumes (VVols) last week. It relates to the way in which snapshots have been implemented in VVols. Historically, VM snapshots have left a lot to be desired. So much so, that GSS best practices for VM snapshots as per KB article 1025279 recommends having on 2-3 snapshots in a chain (even though the maximum is 32) and to use no single snapshot for more than 24-72 hours. VVol mitigates these restrictions significantly, not just because snapshots can be offloaded to the array, but also in the way consolidate and revert operations are implemented.

Let’s start with how things work at the moment with the redo log format snapshots. When we snapshot a base disk, a child delta disk is created. The parent is then considered a point-in-time (PIT) copy. The running point of the virtual machine is now the delta. New writes by the VM go to the delta, reads are still satisfied by the base disk. It would look something like this:

Now when a consolidate operation is needed, in other words we wish to roll up all of the changes from delta-2 into delta-1, we need to redo all the changes into the deltas and update the base disk, as well as change the running point of the VM back to the base VMDK. You could do this by consolidating each snapshot, one at a time. There is of course another way to do it which is to consolidate the whole chain (delete-all). With a very long chain, it takes considerable effort to redo all of the changes in each of the snapshots in the chain to the base disk, especially when there are many snapshot deltas in the chain. Each delta’s set of changes needs to be committed in turn. Not only that, but when a whole chain is consolidated and the base disk is thinly provisioned, it may require additional space as snapshots changes are merged into the base.

One final item to highlight with the redo log format is the revert mechanism. This is quite straight forward with redo logs as we can simply discard the chain of deltas and return to a particular delta or base disk. In this example, we reverted to the base disk by simply discarding the snapshot deltas holding the changes made since we took the snapshots:

Now that we have a grasp of the basic concepts of the snapshot redo log format, lets turn our attention to the new VVol format. The first thing to remember is that with a VVol snapshot, you are always running off of the base disk. The VM no longer has its running point on a snapshot delta. The delta is responsible for maintaining a point-in-time (PIT) copy of the data which means that as the VM does I/O to the base disk, the delta is responsible for tracking the original data block. It would look something similar to the following:

Now things become a whole lot more interesting when it comes to consolidate and revert operations. A consolidate operation no longer means that every child snapshot in the chain has to be read and merged into its parent. For a consolidate operation on VVol snapshots simply means discarding the snapshot chain, as we have the latest and greatest information in the base disk.

Finally, lets look at a revert operation on a VVol snapshot. This entails going back to a particular point in time in the snapshot chain. In this case, we can consider this an undo operation as opposed to a redo operation – we must undo the changes in the base disk with the original blocks stored in a delta/point-in-time copy. This may look similar to the following:

We think that this behaviour is going to lead to major improvements in virtual machine snapshots performance. Since these VVol snapshots will also be offloaded to the array, there should be no restrictions and customers can utilize the full 32 snapshots-in-a-chain limit from vSphere. We refer to these as “managed snapshots” – of course the array itself can support much more than this. This enhancement will also mean that consolidate operations on managed snapshots (which are the most common use cases – think backups, etc) will be pretty instantaneous. Admittedly, the revert operation may be slower than a revert operation on redo log based snapshot, but this is most likely not a very common operation when compared to consolidate.

Note: I haven’t covered snapshots with memory, nor the effect of using VSS – Microsoft’s Volume Shadow Copy Service – on applications running in the guest OS. I also didn’t cover how vSphere can leverage “unmanaged snapshots” on the array. These will be topics for future posts.