When and why do we "stun" a virtual machine?

This is a question that seems to come up regularly, but I don’t think it appears in any great detail in external facing documentation. The question is “when do we stun (or in other words, quiesce) virtual machines”, why do we do it, and more importantly, how long can a stun operation take? One of our staff engineers, Jesse Pool, put together some really good explanations around the VM stun operation, which I am leveraging for this post. I took some particular interest in this as I wrote a bunch of snapshot posts recently around Virtual Volumes (VVols) so I think this fits in quite nicely. A “stun” operation means we pause the execution of the VM at an instruction boundary and allow in-flight disk I/Os to complete. The stun operation itself is not normally expensive (typically a few 100 milliseconds, but it could be longer if there is any sort of delay elsewhere in the I/O stack).

Once a VM is “stunned” (quiesced), we can do some interesting things to the VM. A simple example is suspend the VM. To implement a suspend operation, we first “stun” the VM, and then “serialize” the guest’s memory state to a file on disk. We can later “resume” from the suspended state by reading the suspend state back into memory.

We also use the “stunned” state to create snapshots, delete snapshots, or switch over from one disk copy to another during a live storage migration aka Storage vMotion.

As a rule, usually anything that involves complex disks changes while the VM is running will require stunning the VM. This is because we usually need to close the virtual disks (VMDK), and that can only be done if we first quiesce I/Os.

In the past we have seen long stun times during snapshot creation and snapshot deletion. Snapshot creation can cause a long stun time if there are either many configured disks, or if basic file system operations (like opening files) are taking a long time (e.g. the datastore if overloaded). Snapshot deletion has caused long stun time in the past because the guest was writing to the disk faster than we could combine redo-logs; this issue has since been addressed with a new way of doing consolidation which does not involve collapsing the whole chain, but rather consolidation a child into its parent, one delta at a time.

Let’s look at the scenarios in more detail one at a time, and discuss the use of “stun” in each case:

Create a snapshot

To create a VM snapshot, the VM is “stunned” in order to (i) serialize device state to disk, and (ii) close the current running disk and create a snapshot point. The process is the same for both Windows and Linux. However with VSS integration for Windows guests, this may cause an increase in file system operations, and so it may take longer to complete a “stun” operation in these guests.

Revert a VM to a previous snapshot

A “stun” operation is not required to revert a snapshot.

Delete a snapshot/Delete all snapshots

A distinction needs to be made here. Are we deleting a snapshot in the chain or are we consolidating the disk chain? Deleting a snapshot in the chain never requires stunning the VM, but deleting all snapshots (i.e consolidate operation) always requires stunning the VM.

When consolidating, the VM is “stunned” in order to close the disks and put them in a state that is appropriate for consolidation.

vMotion

A vMotion operation first copies the guest memory state from one host to another host without stunning the VM. Once most of the guest memory is transferred, the source side of the vMotion is stunned and the device state is serialized over the network to the destination (just like the suspend operation discussed previously, but over the network instead of to disk). The destination resumes from the transferred state, and the source powers off. We try to keep the stun time here to a maximum of 1 second.

Storage vMotion

If we are migrating the VM home directory then we use a type of local-host vMotion (self-vMotion).

Migrating virtual disks involves placing a “mirror” driver (a type of filter) on top of the I/O stack. Guest I/Os will be duplicated to both the original virtual disk as well as to a new copy by the mirror driver. The new copy is the target of the migration.

In order to install the mirror we need to close the disks, and in order to close the disks we need to “stun” the VM. Similarly, after the Storage vMotion is complete, we need to remove the mirror driver and so another “stun” is required.

Hope this helps to answer some of your questions around “stun” operations. Kudos once again to Jesse for the details.

When and why do we “stun” a virtual machine?

Published by Cormac

2 Replies to “When and why do we “stun” a virtual machine?”