New mclock I/O Scheduler in vSphere 5.5 – Some details

schedulerMy colleague Duncan wrote a post relatively recently around the new mclock I/O scheduler which VMware introduced in vSphere 5.5. He also mentioned some caveats with the new scheduler, especially around the I/O size (32K) used with the IOPS setting, which may lead to some unexpected behaviour. As Duncan mentioned, the reason for introducing the new scheduler is primarily to provide a better I/O scheduling mechanism that allows for limits, shares and reservations. Unfortunately, we didn’t do a very good job of announcing this change in I/O scheduling, or documenting the behaviour, and it has led to a number of additional questions from our customers. I hope to address some of these here.

Q1. Why was the I/O scheduler changed in vSphere 5.5?
A1. As mentioned, the new mclock I/O scheduler provides additional controls, such as reservations.
Q2. Why does the I/O scheduler IOPS value use 32K as an I/O size?
A2. As Duncan mentions in his post, a 4K I/O request is not the same as a 64K I/O request. There is a different cost associated with the service time of each I/O. When you think about disk I/O, there are two majors factors in the service time
  • Setting the disk head to the request offset
  • Transferring the data from that offset onwards
For small I/O requests, (1) is the dominating factor. For large requests, it is the transfer (2) that dominates. However, in an I/O scheduler, you need a way to account for these differences when enforcing shares, limits, and reservations. Hence, you pick a request size that would balance the costs of (1) and (2) as the scheduler’s normalization factor. This is why we picked the I/O request size to be 32K.  This seems an acceptable I/O size for disks.
Q3. Are IOPS limits also considered for cloning and snapshot tasks?
A3. Yes.
Q4. How are IOPS limits handled with VAAI (vSphere APIs for Array Integration) turned on? Are the offloaded tasks (e.g. cloning) considered?
A4. Yes. I/Os handled by VAAI are accounted for in similar way as normal non-VAAI I/Os. Therefore the calculations of VAAI IOPS is also similar to non-VAAI mode IOPS.
Q5. Does the throughput/bandwidth controls (MB/s limit) behave like the IOPS limit in terms of a Storage vMotion operation, cloning etc.? Is this documented somewhere?
A5. The mclock scheduler in vSphere 5.5 doesn’t support bandwidth controls. See the vSphere 5.5U1 Release Notes for further information. In addition, there is a caveat with the mclock scheduler whereby the bandwidth and throughput limits are violated when both are configured for a SCSI virtual disk in the configuration file of a virtual machine. This caveat is documented in KB article 2059192. If you wish to use this functionality, you will need to revert to an earlier version of the scheduler as per the KB.
Q6. Why is Storage vMotion limited for powered-on VMs if they have an IOPS limit set (e.g. 160 IOPS) but if the VM is powered-off, and then cold migrated, the operation takes all the IOPS it can get (e.g. 300 IOPS)?
A6. This is intentional behavior and was done to respect the IOPS limit set for a given VM. Storage vMotion is designed to inherit the IO limits of whatever disk is being moved, or whatever VM is being moved. In a nutshell, we do not want We do not allow the datamover to violate whatever limits the customer has imposed on a given VM’s IO.
  1. This may seem like a silly question, but I don’t see a reservation setting available for a vdisk on a HW10 VM on an ESXi 5.5U2 host. What am I missing?

    (Yes Disk.SchedulerWithReservation=1)

    • Not a silly question – remember what I said about us not documenting this very well.

      Anyhow, there are some teasers in our API docs –

      Note the ‘Reservation’ entry – Reservation control is used to provide guaranteed allocation in terms of IOPS. Large IO sizes are considered as multiple IOs using a chunk size of 32 KB as default. This control is initially supported only at host level for local datastores. It future, it may get supported on shared storage based on integration with Storage IO Control. Also right now we don’t do any admission control based on IO reservation values.

      One of our clever guys wrote a script to call this, and what we saw was sched.scsi0:0.reservation = “10” appear in the .vmx file of the VM (10 representing the IOPS reservation). Now one assumes you could add this in to get your IOPS reservation, but I haven’t been able to find any customer facing docs that talk about it in earnest, other than the one listed above.

      Also note the statement “initially supported only at host level for local datastores”.

      • Thanks for the clues Cormac. Perhaps my confusion has stemmed from the fact that I read the original paper by Ajay Gulati which only spoke to mclock in relation to SAN storage.

Comments are closed.