Automating the IOPS setting in the Round Robin PSP

A number of you have reached out about how to change some of the settings around path policies, in particular how to set the default number of iops in the Round Robin path selection policy (PSP) to 1. While many of you have written scripts to do this, when you reboot the ESXi host, the defaults of the PSP are re-applied and then you have to run the scipts again to reapply the changes. Here I will show you how to modify the defaults so that when you unclaim/reclaim the devices, or indeed reboot the host, the desired settings come into effect immediately.

Before we begin, lets just revisit this whole IOPS=1 setting in the round robin path selection policy. My pal Duncan has posted his concerns about this before. Just to recap, Round Robin PSP works best at scale. If you do some testing with a single VM on a single datastore, then you are going to see improvements with the IOPS=1 setting when compared to the default setting of 1000. However, once you start to scale (multiple VMs deployed across multiple datastores), then the default value provided by VMware should provide just as good performance. Now I’m not going to question how our storage partners are arriving at the recommendation for IOPS=1, but I hope they are testing the path selection policy at scale before arriving at the recommendation. Let’s move on.

Each storage array supported on the VMware Hardware Compatibility List (HCL) will have a Storage Array Type Plugin (SATP) associated with it. This decide what conditions will fail an I/O over to an alternate path (i.e. which SCSI sense codes). With each SATP, there is a default Path Selection Policy (PSP) which determines which path to use for I/O. One of these PSPs is Round Robin, a path selection policy which balances I/O across all active paths. When to choose the next path to send the I/O is based on certain criteria. By default, PSP will send 1,000 I/Os down the first path, then 1,000 I/Os down the next and so on in a round robin fashion. Many of our storage partners are recommending that this should be set to a single I/O before moving to the next path. HP recommend it here on page 29 and EMC recommend it here on page 18.

The command to associated the Round Robin PSP with a modified the IOPS setting is via this esxcli command:

# esxcli storage nmp satp rule add -s “TestSATP” -V “TestVendor” -M “TestModel” -P “VMW_PSP_RR” -O “iops=1”

In order to identify the “Vendor” and “Model” variables, you will need to do the following:

1. Perform a rescan on an ESX/ESXi host
2. Perform a “grep -i scsiscan /var/log/vmkernel.log”
3. In there, you will see the Vendor and Model for the device. Some examples are below:

0:00:00:29.585 cpu2:4114)ScsiScan: 1059: Path ‘vmhba36:C0:T0:L0’: Vendor: ‘Dell’ Model: ‘MD32xxi’ Rev: ‘7320’

2013-06-19T23:55:11.838Z cpu18:8728)ScsiScan: 888: Path ‘vmhba1:C0:T0:L0’: Vendor: ‘DGC ‘ Model: ‘RAID 5 ‘ Rev: ‘0430’
2013-06-19T23:55:11.838Z cpu18:8728)ScsiScan: 891: Path ‘vmhba1:C0:T0:L0’: Type: 0x0, ANSI rev: 4, TPGS: 3 (implicit and explicit)

You can see from the above examples that we have a Vendor “Dell” and a Model “MD32xxi” in the top line and Vendor “DGC” and Model “RAID 5” in the second.

If the above esxcli command is run with these Vendor and Model values, any devices discovered by the ESXi from this array type will have the Round Robin PSP automatically associated with it, but will also have the IOPS value set to 1. Note that if the Vendor ID contains trailing spaces, as in the case of the EMC DGC model, the trailing spaces must be included. If there are any claim options associated with the device (as is the case of TPGS for ALUA arrays), these must also be included.

In order to get the claimrule to load it would be far easier to reboot the box, otherwise you would have to unload each device already claimed, and re-load the claimrules. However, if you are not in a position to reboot, these are the steps to unclaim and reload the claimrules

# esxcli storage core claiming unclaim -t device -d naa.xxxx

Repeat the above command for all LUNs presented. Once that is done, run this load command followed by a rescan:

# esxcli storage core claimrule load
# esxcfg-rescan vmhbaX

One caveat to this is if the customer is using Microsoft Clustered RDM’s, if they are then they will need to manually change them to Fixed or MRU as those LUNs will also be claimed with Round Robin which in the current released versions of ESX/ESXi is not supported.

Hope you find this useful.

20 Replies to “Automating the IOPS setting in the Round Robin PSP”

  1. I would agree that iops=1 does not make much sense in an environment with a significant number of VMs and datastores. But on the other hand one usually sees spikes from single VMs being I/O bound for a certain (short) amount of time.
    Wouldn’t it be worth considering to set the IOPS parameter equal to the actual queue depth of a path?

    1. Hi Martin,

      I suppose if there was only a single LUN down that path, then IOPs = device queue depth would make sense. However, typically there are lots of devices visible down each path, so at that point you would be limiting yourself imo.

      1. Cormac – I think you are doing a lot of assuming on the validity / goodness of the default – more thought/effort went into the vendor recommendations than you suggest. “However, once you start to scale (multiple VMs deployed across multiple datastores), then the default value provided by VMware should provide just as good performance. ” <– that's a pretty sizable and unsupported assumption.

        I can name at least 2 arrays where this default value is VERY significantly problematic, and changing it helps performance significantly (at scale). Both Symmetrix storage and 3PAR benefit from this very significantly, for example. There are more besides that.

        Both of these arrays improve, and both were well tested to prove the improvement.

        1. Matt, when vendors don’t publish their scaling numbers (hosts, datastores, VMs) when making these recommendations, then yes, I can only make assumptions.

          Can you share any reference architecture details about the configurations used to arrive at the IOPS=1 value? What were the issues seen when the IOPs value was set to the default?

  2. Thank you for this, it’s very helpful. I found that with EqualLogic iSCSI devices, IOPS=3 works very well. The only time I’ve seen the settings revert was in an early build of ESXi v5.0.

    This is the script we have been using.
    esxcli storage nmp satp set –default-psp=VMW_PSP_RR –satp=VMW_SATP_EQL ; for i in `esxcli storage nmp device list | grep EQLOGIC|awk ‘{print $7}’|sed ‘s/(//g’|sed ‘s/)//g’` ; do esxcli storage nmp device set -d $i –psp=VMW_PSP_RR ; esxcli storage nmp psp roundrobin deviceconfig set -d $i -I 3 -t iops ; done

    Yours has the benefit of set once and forget it. This one has to be re-run as new LUNs are added.

    Moving off of FIXED or RR with IOPs = 1000 has resolved “latency alerts” in ESXi v5.x for many of our customers. Where ESX is reporting latency problems but the SAN doesn’t show any signs of latency issues.

  3. So if we wanted to do this via powercli as more of an audit process to enforce the default claim rule using the $esxcli.storage.nmp.satp.rule.add what would the structure be like?

  4. Hi Cormac!

    Thanks for considering. 🙂 The reason I put it as a question is that I’m unsure which queues are affected by the path switch, or the actual queue depth of a *path*. From this http://blogs.vmware.com/vsphere/2012/07/troubleshooting-storage-performance-in-vsphere-part-5-storage-queues.html it seems as if the device (LUN) queue was the last buffer, but from this http://virtualgeek.typepad.com/virtual_geek/2009/06/vmware-io-queues-micro-bursting-and-multipathing.html it should be the HBA buffer (which would make more sense to me). In the latter case and with a default HBA queue depth of 1024 it would explain why the default of 1000 IOPS fits, wouldn’t it?
    But the PSP is individual for each LUN, so – when does a path change kick in, which queues are affected? If it’s after the LUN queue, I would use an IOPS value equal to the LUN queue depth (maybe half of it, have to think about that…). This should make better use of multiple paths than the default, no matter how many LUNs are visible. Even if these are so many that the average load will be roughly the same on all physical links, I/O bursts to single LUNs should benefit, or did I miss something?

    Interesting topic you put up (again), thanks. 🙂

    In the end I assume(!) that the storage system will more likely be the limiting factor anway…

    1. Good questions Martin. The answer will be dependent on the nature of the failover. Is it going to failover to an alternate path to the LUN, whilst remaining on the same HBA, or is it going to failover to an alternate path to the LUN, but on a different HBA. The amount of i/o that can be queued to a device are limited by device/lun queue depths, but then there is the global maximum for the HBA, which limits how many devices can operate at the queue depth. In other words HBA queue depth must be greater than or equal to the device queue depth * LUNs presented to that HBA.

  5. Hi Cormac,
    yeah, these numbers I know. And the issue of for example two HBAs with two ports offering four paths (on the ESXi side, obviously the total will be higher) complicates the whole thing even more, so I left that out for now. Or SIOC and adaptive queueing. 🙂
    Bottom line: too complex for non-trivial environments (aka production) , so personally I prefer to stick to the default – keeping it simple.
    But I’d still love to learn how and at which point the path failover affects the queues. 🙂 Anybody willing to explain?

  6. Hi Cormac,

    I followed the steps you suggested but still I can’t manage to get new devices with default configuration of iops=1. I did a full reboot for the ESX host but things are the same.
    Any suggestions? Anyone actually tried it and got it to work? I know this is not very informational, but I didn’t want to leave a overfilled comment.

    Thanks,

  7. What storage are you using? Getting the vendor ID and model number correct is essential.

    Please run:
    #esxcli storage nmp device list and provide a small subset of that output.

    Also, are you saying only new volumes don’t get the IOPs value but you previous ones do?

    1. Hi Don,

      The vendor ID and model name I got by issuing the command:
      #grep -i scsiscan /var/log/vmkernel.log
      Had trailing spaces in it. As stated in the blog post, I included all spaces in the esxcli command, unfortunately it didn’t work.

      After testing some variations of the command, eventually, I found out that omitting the spaces was the solution.

      I hope that someone will find this helpful.

  8. Hi Cormac and everybody else

    I too do not see the idea in switching path for every IOPS – or at least not when we’re talking about fx iSCSI.
    But I do see the idea in tweaking it to fit with the frame size, because do we really want to switch paths after every IOPS or do we want to switch paths when the frame size is reached so as to saturate the entirety of our bandwidth back to the SAN?
    And I’m not the only one with this thought; Dave Gibbons has given his view on a usable tweak here: http://blog.dave.vc/2011/07/esx-iscsi-round-robin-mpio-multipath-io.html
    Sadly enough I haven’t got the opertunaty to test any of these solutions, so I’d very much like to hear what you guys think?

    Best regards, Daniel…

  9. I agree that setting the IOPS to 1 is not needed. But whether you set it to 3 or by frame size, the end result is the same. Better use of available paths.

    Customers who have made the change have all reported much better results over the default.

    Re: Saturate. In a huge majority of cases, bandwidth is rarely saturated. Especially, with 10GbE. Even more so in a HyperVisor environment like ESX. The ability to rapidly process IO requests is more important that MB/sec. High throughput requires larger blocks, and highly sustained IO. Not something you normally see with ESX. Many VMs, doing small(ish) random IO s are more the norm.

    So getting as many IOs going over multiple paths yields better results.

    For EQL & ESX, there is a Best Practices guide available here:

    http://en.community.dell.com/techcenter/extras/m/white_papers/20434601.aspx

    It includes setting the IOPs to 3.

    Regards,

    Don

  10. By this setting, if we manually change a LUN to FIX, after reboot will it be revert back to RR?

Comments are closed.