vSphere 5.1 Storage Enhancements – Part 4: All Paths Down (APD)

All Paths Down (APD) is a situation which occurs when a storage device is removed from the ESXi host in an uncontrolled manner, either due to administrative error or device failure. Over the previous number of vSphere releases, VMware has made significant improvements to handling the APD, the All Paths Down, condition. This is a difficult condition to manage since we don’t know if the device is gone forever or if it might come back, i.e. is it a permanent device loss or is it a transient condition.

The bigger issue around APD is what it can do to hostd. Hostd worker threads will sit waiting for I/O to return indefinitely (for instance, when a rescan of the SAN is initiated from the vSphere UI) . However hostd only has a finite number of worker threads, so if these all get tied up waiting for disk I/O, then other hostd tasks will be affected. A common symptom of APD is ESXi hosts disconnecting from vCenter is because their hostd daemons have become wedged.

I wrote about the vSphere 5.0 enhancements that were made to APD handling on the vSphere Storage blog here if you want to check back on them. Basically a new condition was introduced in vSphere 5.0. This condition is known as PDL (Permanent Device Loss) and this is where we knew that the device was never coming back. We learnt this through SCSI sense codes sent by the target array. Once we had a PDL condition, we could fast-fail I/Os to the missing device and prevent hostd getting tied up.

Following on from the 5.0 APD handling improvements, what we want to achieve in vSphere 5.1 is as follows:

  • Handle more complex transient APD conditions, and not have hostd getting stuck indefinitely when devices are removed in an uncontrolled manner.
  • Introduce some sort of PDL method for those iSCSI arrays which present only one LUN for target. These arrays were problematic for APD handling, since once the LUN went away, so did the target, and we had no way of getting back any SCSI sense codes.

It should be noted that in vSphere 5.0U1, we fixed an issue with vSphere correctly detecting PDL, and restarting VMs on other hosts in a vSphere HA cluster which may not have this APD state. This enhancement is also in 5.1.

Complex APD
As I have already mentioned, All Paths Down affects more than just Virtual Machine I/O. It can also affect hostd worker threads, leading to host disconnects from vCenter in worst case scenarios.  It can also affect vmx I/O when updating Virtual Machine configuration files. On occasion, we have observed scenarios where the .vmx file was affected by an APD condition.

In vSphere 5.1, a new timeout value for APD is being introduced. There will be a new global setting for this feature called Misc.APDHandlingEnable. If this value is set to 0, the current (5.0) behavior of retrying  failing I/Os forever will be used. If Misc.APDHandlingEnable is set to 1 (default), APD Handling will be enabled to follow the new model using the time out value Misc.APDTimeout.

This is set to 140 second timeout by default, tuneable. [The lower limit is 20 seconds but this is only for testing]. These settings (Misc.APDHandlingEnable & Misc.APDTimeout) are exposed in the vSphere UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout.  Any further I/Os are fast-failed with a status of NO_CONNECT. This is the same sense code observed when an FC cable is disconnected from an FC HBA. This fast failing of I/Os prevents hostd from getting stuck waiting on I/O.  If any of the paths to the device recovers, subsequent I/Os to the device are issued normally and special APD treatment finishes.

Single-Lun, Single-Target
We also wanted to extend the PDL (Permanent Device Loss) detection to those arrays that only have a single LUN per Target. On these arrays, when the LUN disappears, so does the target so we could never get back a SCSI Sense Code as mentioned earlier.

Now in 5.1, the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects our effort to access the storage. Depending on the response from the array, we can say the device is in PDL, not just unreachable.

I’m very pleased to see these APD enhancements in vSphere 5.1. The more that is done to mitigate the impact of APD, the better.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @CormacJHogan

19 Replies to “vSphere 5.1 Storage Enhancements – Part 4: All Paths Down (APD)”

  1. Very clear explanations, as usual I would say. Thanks for these precious information, I remember putting this on my wishlist during some random discussions in VMworld EMEA in 2009, now it’s real!

  2. These timeout values are gonna save a lot of people some serious issues. Its only a few clicks to accidentally unpresent a LUN in some systems (like Clariion) or get into APD in other ways. . Its happened to me twice.

  3. Cormac, if you end up having APD situation with default settings, can you set Misc.APDHandlingEnable to 1 while APD and get the timeout timer running?

    1. Hey Tomi, so Misc.APDHandlingEnable is on by default in 5.1. However, if for some reason you had set it to off, and sufferend an APD, then yes – turning it on will start the timer for any APD devices.

      1. I have dealt with two very large customers who are using 5.1 with the new option enabled with really bad results. The suggested reference above states a better handling of effectively synchronous IO events that are never going to get a response, therefore they will wait forever. This creates the lack of sufficient worker threads. The idea to identify that case block the start of new thread on a known dead path. In large environments (5000+) machines, the odds of a 100 LUN+ configuration seeing a SAN delay reaching beyond 140 seconds is real. The experience we have is that the 100+ VM’s per host saturate the available worker threads within 10 seconds and then removes all LUNs from active IO. When this occurs hostd turns into a rock anyway.

        I am trying to determine in these environments what is better. The 5.0 handling or 5.1. I currently do not know.

  4. Cormac, can you help me understand a situation that recently happened to me, which I think this new 5.1 setting may mitigate? Imagine dual SAN fabrics, where one fabric logically goes down but sends no sense codes to the ESXi that it is not processing I/Os. The switch is still up and link status is fine, but I/Os are dropped on the floor.

    ESXi freezes up, even though two good paths on the other fabric are available (active/active round robin). Only when the affected fabric HBAs are put in a down state does ESXi unfreeze and resume sending I/Os down the two (always available) paths. #1) Not sure why ESXi couldn’t use the active paths in this case and 2) Sounds like the new 5.1 140s timeout could quickly and automatically fix the freezing problem?

    1. Derek,
      This is not an APD (All Paths Down) condition, so the new timeout setting will not help. You still have 2 active paths. We won’t be able to troubleshoot the issue in the comments of the blog post, but you should engage our support guys to figure out why the remaining 2 paths were dropping I/Os.

      1. Thanks. I did open a case, and the explanation was unless the SAN switch sends a SCSI sense code about the condition, ESXi will retry the I/Os forever on the black-hole path and not use the active paths. Didn’t seem right to me, but that was their verdict.

  5. We had an issue yesterday where we had to reset one of our SANs due to a configuration change on the SAN, and while we have never had problems doing this in the past (very few times of course), vSphere 5.0 Build 6xx,xxx hosts showed no problems in losing the path to the Datastores and re-connecting when the SAN came back up (about 2 minutes). vSphere 5.0 Build 9xx,xxx hosts showed a PDL (inactive Datastores and “lost communication” at the LUN level on the Storage Adapter page).

    This is the first time we had ever had to deal with a PDL, and it appears there is no graceful recovery from this? I have a ticket open with support to hopefully get some good news that there is a “re-connect” that can be done in this situation.

    The only option was to vMotion VMs that had shared storage that was accessible, and reboot the host (not in maintenance mode). Once the host came back up, the Datastores were active, VMs powered up automatically that were basically hung on the inactive Datastores, and all was well again. This doesn’t seem like a great solution, of course.

    Support has indicated this: As a best practice we need to shut down the VM’s, shutdown the host, reboot the SAN. Shutting down all hosts is the best practice defined by VMware.

    So, if a SAN needs to be reset, or all paths “to a specific SAN” (we have many) need to be taken down temporarily, the entire cluster needs to be shut down? Seems absurd!?

    Just thought I’d check to see if there were any ideas others had – and if there are enough improvements in 5.1 to avoid this issue entirely? A “re-connect” option at the LUN level in the Storage Adapters section where “lost communication” is listed would be the easiest solution.

    Eric

  6. Hi Cormac,
    We opened a case with the APD problem on a ESX5.1:

    Why: If a backup storage go down – the ESX host show an APD and stopped all VMs that had a connection – a little crazy for a backup lun!
    The backup Lun is no operational responsibility – only backup.

    -> Means that if an unimportant LUN stops, we lose 100 or 1000 important VM’s.

    Second bad function:
    Oracle Storagereplication – are> the replicated Oracle DB on RAW device (such as VMware’s recommended). If the connection to second vmdk lost (for the RAW device) -> Oracle server stops and sometimes it comes to a crash of the file system (corrupt).

    Shane is working on this case.

    Can you help us – maybe new features in 5.5?

    thx

    Andreas

    1. I’d recommend driving this via the support channels. That’s going to be the quickest route to an explanation.

      All I’d add is that, in the first item, if the device entered APD and not PDL, then it must have been an ‘unusual’ failure to have impacted the ESX host. Typically the failure will send back a SCSI sense code and the ESXi host can handle it appropriately (via PDL).

      I’ve no experience of Oracle’s storage replication product, so can’t really comment on that one.

Comments are closed.