This is a question that seems to come up regularly, but I don’t think it appears in any great detail in external facing documentation. The question is “when do we stun (or in other words, quiesce) virtual machines”, why do we do it, and more importantly, how long can a stun operation take? One of our staff engineers, Jesse Pool, put together some really good explanations around the VM stun operation, which I am leveraging for this post. I took some particular interest in this as I wrote a bunch of snapshot posts recently around Virtual Volumes (VVols) so I think this fits in quite nicely. A “stun” operation means we pause the execution of the VM at an instruction boundary and allow in-flight disk I/Os to complete. The stun operation itself is not normally expensive (typically a few 100 milliseconds, but it could be longer if there is any sort of delay elsewhere in the I/O stack).
Recently I published an article on Virtual Volumes (VVols) where I touched on a comparison between how migrations typically worked with VAAI and how they now work with VVols. In the meantime, I managed to have some really interesting discussions with some of our VVol leads, and I thought it worth sharing here as I haven’t seen this level of detail anywhere else. This is rather a long discussion, as there are a lot of different permutations of migrations that can take place. There are also different states that the virtual machine could be in. We’re solely focused on VVols here, so although different scenarios are offered up, I highlight what scenario we are actually considering.
Let’s begin this post with a recap of the Storage vMotion enhancements made in vSphere 5.0. Storage vMotion in vSphere 5.0 enabled the migration of virtual machines with snapshots and also the migration of linked clones. It also introduced a new mirroring architecture which mirrors the changed disk blocks after they have been copied to the destination, i.e. we fork a write to both source and destination using mirror mode. This means migrations can be done in a single copy operation. Mirroring I/O between the source and the destination disks has significant gains when compared to the iterative disk pre-copy changed block tracking (CBT) mechanism in the previous version & means more predictable (and shorter) migration time.