Site icon CormacHogan.com

vSphere 6.0 Storage Features Part 6: action_OnRetryErrors

In vSphere 6.0, an improvement has been made to how we handle I/O issues, such as flaky drivers, misbehaving firmware, dropped frames, fabric disruption, dodgy array firmware, and so on which can cause I/O failures. The issue is that, previously, we continually retry these sorts of I/O errors, which can lead to all sorts of additional problems. In this release we are changing our behaviour for marking a path dead.

To help address flaky I/O problems and mark a path as dead when they occur, there is a new tuneable option introduced in vSphere 6.0 called action_OnRetryErrors. By default is it disabled. It can be seen in the esxcli storage nmp device list command:

Storage Array Type Device Config: {implicit_support=on;
explicit_support=off; explicit_allow=on;alua_followover=on; 
action_OnRetryErrors=off; 
{TPG_id=0,TPG_state=ANO}{TPG_id=1,TPG_state=AO}}

The new option will allow an Storage Array Type Plugin (SATP) in 6.0 to mark a path “dead” and failover to another path. Previously we would repeatedly try to send I/O to the same path when it fails with retry-able errors. In many cases, it could be that the path is bad or busy but the device is fully functional. By changing the path state to dead, a failover will be triggered and an alternate working path to the device will be used. Of course, if the problem is with the device (it is overloaded or slow) then this failover to an alternate path will not help. However we are no worse off by failing to a new path since the I/O is failing anyway; it is something worth trying.

The option can be set on an SATP using the following command:

# esxcli storage nmp satp generic deviceconfig set -c \
enable_action_OnRetryErrors -d naa.XXX

The option can be reset using the following command:

# esxcli storage nmp satp generic deviceconfig set -c \
disable_action_OnRetryErrors -d naa.XXX

If you want the option to persist across reboots (added as part of a SATP claim rule), then the  -o|–option can be used, as per the example below:

# esxcli storage nmp satp rule add -t device -d naa.XXX -s \
VMW_SATP_EXAMPLE -P VMW_PSP_FIXED -o enable_action_OnRetryErrors

This is an excellent enhancement to the PSA, Pluggable Storage Architecture. If you wish to learn more about the PSA, you can start with this series of articles.

Exit mobile version