vSphere 5.1 Storage Enhancements – Part 4: All Paths Down (APD)

Cormac

12 years ago

All Paths Down (APD) is a situation which occurs when a storage device is removed from the ESXi host in an uncontrolled manner, either due to administrative error or device failure. Over the previous number of vSphere releases, VMware has made significant improvements to handling the APD, the All Paths Down, condition. This is a difficult condition to manage since we don’t know if the device is gone forever or if it might come back, i.e. is it a permanent device loss or is it a transient condition.

The bigger issue around APD is what it can do to hostd. Hostd worker threads will sit waiting for I/O to return indefinitely (for instance, when a rescan of the SAN is initiated from the vSphere UI) . However hostd only has a finite number of worker threads, so if these all get tied up waiting for disk I/O, then other hostd tasks will be affected. A common symptom of APD is ESXi hosts disconnecting from vCenter is because their hostd daemons have become wedged.

I wrote about the vSphere 5.0 enhancements that were made to APD handling on the vSphere Storage blog here if you want to check back on them. Basically a new condition was introduced in vSphere 5.0. This condition is known as PDL (Permanent Device Loss) and this is where we knew that the device was never coming back. We learnt this through SCSI sense codes sent by the target array. Once we had a PDL condition, we could fast-fail I/Os to the missing device and prevent hostd getting tied up.

Following on from the 5.0 APD handling improvements, what we want to achieve in vSphere 5.1 is as follows:

Handle more complex transient APD conditions, and not have hostd getting stuck indefinitely when devices are removed in an uncontrolled manner.
Introduce some sort of PDL method for those iSCSI arrays which present only one LUN for target. These arrays were problematic for APD handling, since once the LUN went away, so did the target, and we had no way of getting back any SCSI sense codes.

It should be noted that in vSphere 5.0U1, we fixed an issue with vSphere correctly detecting PDL, and restarting VMs on other hosts in a vSphere HA cluster which may not have this APD state. This enhancement is also in 5.1.

Complex APD
As I have already mentioned, All Paths Down affects more than just Virtual Machine I/O. It can also affect hostd worker threads, leading to host disconnects from vCenter in worst case scenarios. It can also affect vmx I/O when updating Virtual Machine configuration files. On occasion, we have observed scenarios where the .vmx file was affected by an APD condition.

In vSphere 5.1, a new timeout value for APD is being introduced. There will be a new global setting for this feature called Misc.APDHandlingEnable. If this value is set to 0, the current (5.0) behavior of retrying failing I/Os forever will be used. If Misc.APDHandlingEnable is set to 1 (default), APD Handling will be enabled to follow the new model using the time out value Misc.APDTimeout.

This is set to 140 second timeout by default, tuneable. [The lower limit is 20 seconds but this is only for testing]. These settings (Misc.APDHandlingEnable & Misc.APDTimeout) are exposed in the vSphere UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout. Any further I/Os are fast-failed with a status of NO_CONNECT. This is the same sense code observed when an FC cable is disconnected from an FC HBA. This fast failing of I/Os prevents hostd from getting stuck waiting on I/O. If any of the paths to the device recovers, subsequent I/Os to the device are issued normally and special APD treatment finishes.

Single-Lun, Single-Target
We also wanted to extend the PDL (Permanent Device Loss) detection to those arrays that only have a single LUN per Target. On these arrays, when the LUN disappears, so does the target so we could never get back a SCSI Sense Code as mentioned earlier.

Now in 5.1, the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects our effort to access the storage. Depending on the response from the array, we can say the device is in PDL, not just unreachable.

I’m very pleased to see these APD enhancements in vSphere 5.1. The more that is done to mitigate the impact of APD, the better.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @CormacJHogan