vSphere 6.0 Storage Features Part 6: action_OnRetryErrors
In vSphere 6.0, an improvement has been made to how we handle I/O issues, such as flaky drivers, misbehaving firmware, dropped frames, fabric disruption, dodgy array firmware, and so on which can cause I/O failures. The issue is that, previously, we continually retry these sorts of I/O errors, which can lead to all sorts of additional problems. In this release we are changing our behaviour for marking a path dead.
To help address flaky I/O problems and mark a path as dead when they occur, there is a new tuneable option introduced in vSphere 6.0 called action_OnRetryErrors. By default is it disabled. It can be seen in the esxcli storage nmp device list command:
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on;alua_followover=on; action_OnRetryErrors=off; {TPG_id=0,TPG_state=ANO}{TPG_id=1,TPG_state=AO}}
The new option will allow an Storage Array Type Plugin (SATP) in 6.0 to mark a path “dead” and failover to another path. Previously we would repeatedly try to send I/O to the same path when it fails with retry-able errors. In many cases, it could be that the path is bad or busy but the device is fully functional. By changing the path state to dead, a failover will be triggered and an alternate working path to the device will be used. Of course, if the problem is with the device (it is overloaded or slow) then this failover to an alternate path will not help. However we are no worse off by failing to a new path since the I/O is failing anyway; it is something worth trying.
The option can be set on an SATP using the following command:
# esxcli storage nmp satp generic deviceconfig set -c \ enable_action_OnRetryErrors -d naa.XXX
The option can be reset using the following command:
# esxcli storage nmp satp generic deviceconfig set -c \ disable_action_OnRetryErrors -d naa.XXX
If you want the option to persist across reboots (added as part of a SATP claim rule), then the -o|–option can be used, as per the example below:
# esxcli storage nmp satp rule add -t device -d naa.XXX -s \ VMW_SATP_EXAMPLE -P VMW_PSP_FIXED -o enable_action_OnRetryErrors
This is an excellent enhancement to the PSA, Pluggable Storage Architecture. If you wish to learn more about the PSA, you can start with this series of articles.
Hi Cormac, looking forward to this feature!
One question though, when that dead path becomes again stable (after the fixing/remediation of the problem), will it be automatically taken back in production ?
In other words, when a path is marked as dead, will vSphere make some regular retries on it, by instance : trying it again every X minutes ?
Grtz, Peter
Yes – paths are tested every 300 seconds iirc. I believe it is also tunable.
OK, thanks for the positive response. This feature will certainly help in making the vSphere RR MPIO mechanism more robust overall. One last question, will we be able to activate the action_OnRetryErrors by default on a specific SATP ? (like with a new SATP claimrule ?) Or is it only configurable per device is an your example?
Grtz, Peter
Sorry for the late response. The answer is I’m not sure as I haven’t tested it, but one assumes it is possible.
Hey Cormac,
Would you use this in a vMSC environment?
Pete
Not until it has been tested Pete, and I’m not sure anyone has yet.