A very quick “public service announcement” post this morning folks, simply to bring your attention to a new knowledge base article that our support team have published. The issue relates to APD (All Paths Down) which is a condition that can occur when a storage device is removed from an ESXi host in an uncontrolled manner. The issue only affects ESXi 6.0. The bottom line is that even though the paths to the device recover and the device is online, the APD timeout continues to count down and expire, and as a result, the device is placed in APD timeout state. This obviously has an impact on virtual machines, workloads, etc, that are using this device.
Unfortunately there is no resolution at this time, but there are some workarounds detailed in the KB article. For those of you who dealt with APD events in earlier versions of vSphere, you’ll know the drill.
The KB article is 2126021. Note that this doesn’t affect all APD behaviors. Most APD events are handled just fine. However I’d urge you to take a quick read of the KB just to familiarize yourself with the behaviour and workarounds while we work on a permanent solution.
I had a query recently about changes to vSphere 6.0, especially when it comes to vSphere HA and Component Protection (VMCP) with vMSC, vSphere Metro Storage Cluster. The question is very straight forward – do all the same advanced setting recommendations for PDL and APD apply to vMSC on vSphere 6.0 as they did for vSphere 5.5? Or do we have some new recommendations now around PDL and APD for vMSC with the introduction of VMCP in vSphere 6.0?
All Paths Down (APD) is a situation which occurs when a storage device is removed from the ESXi host in an uncontrolled manner, either due to administrative error or device failure. Over the previous number of vSphere releases, VMware has made significant improvements to handling the APD, the All Paths Down, condition. This is a difficult condition to manage since we don’t know if the device is gone forever or if it might come back, i.e. is it a permanent device loss or is it a transient condition.