A very quick “public service announcement” post this morning folks, simply to bring your attention to a new knowledge base article that our support team have published. The issue relates to APD (All Paths Down) which is a condition that can occur when a storage device is removed from an ESXi host in an uncontrolled manner. The issue only affects ESXi 6.0. The bottom line is that even though the paths to the device recover and the device is online, the APD timeout continues to count down and expire, and as a result, the device is placed in APD timeout state. This obviously has an impact on virtual machines, workloads, etc, that are using this device.
Unfortunately there is no resolution at this time, but there are some workarounds detailed in the KB article. For those of you who dealt with APD events in earlier versions of vSphere, you’ll know the drill.
The KB article is 2126021. Note that this doesn’t affect all APD behaviors. Most APD events are handled just fine. However I’d urge you to take a quick read of the KB just to familiarize yourself with the behaviour and workarounds while we work on a permanent solution.
I’ve been hit up this week by a number of folks asking about “ATS Miscompare detected between test and set HB images” messages after upgrading to vSphere 5.5U2 and 6.0. The purpose of this post is to give you some background on why this might have started to happen.
In vSphere 5.5U2, we started using ATS for maintaining the heartbeat. Prior to this release, we only used ATS when the heartbeat state changed. For example, referring to the older blog, we would use ATS in the following cases:
Acquire a heartbeat
Clear a heartbeat
Replay a heartbeat
Reclaim a heartbeat
We did not use ATS for maintaining the ‘liveness’ of a heartbeat. This is the change that was introduced in 5.5U2 and which appears to have led to issues for certain storage arrays.
I wouldn’t normally call out new patch releases in my blog, but this one has an important fix for Virtual SAN users. As per KB article 2102046, this patch addresses a known issue with clomd. The symptoms are as follows:
Virtual machine operations on the Virtual SAN datastore might fail with an error message similar to the following:
Virtual SAN cluster might report that the Virtual SAN datastore is running out of space even though space is available in the datastore. An error message similar to the following is displayed:
There is no more space for virtual disk .vmdk. You might be able to continue this session by freeing disk space on the relevant volume, and clicking _Retry. Click Cancel to terminate this session.
The clomd service might also stop responding.
While the clomd issue is easily addressed by restarting the clomd service, consider deploying this patch during your next maintenance cycle to avoid this annoyance, or if you re considering a new VSAN deployment, definitely consider using this latest version of ESXi 5.5.
A quick note to let you know about a new KB article that has recently been published which reports incorrect values for Outstanding IO in the VSAN Observer tool used for monitoring performance of VSAN deployments when using vSphere 5.5U2.
Virtual SAN (VSAN) Observer graphs in the “VSAN Client”, “VSAN Disk”, “DOM Owner” or individual VSAN object on the “VM” tab show very high Outstanding I/O (OIO) value that is inconsistent with the actual I/O load.
Here is a sample screenshot from my VSAN environment running vSphere 5.5U2. As you can see the Outstanding IO values are off the scale:
Of course, this behaviour may lead to you “chasing your tail” so to speak when monitoring or troubleshooting VSAN, so we are working on getting this resolved asap. Check the KB article regularly for updates regarding a fix. In the meantime, understand that a high Outstanding IO count in VSAN Observer is expected and may not be the symptom of any underlying issue.
I’m a bit late in bringing this to your attention, but there is a potential issue with VASA storage providers disconnecting from vCenter resulting in no VSAN capabilities being visible when you try to create a VM Storage Policy. These storage providers (there is one on each ESXi host participating in the VSAN Cluster) provide out-of-band information about the underlying storage system, in this case VSAN. If there isn’t at least one of these providers on the ESXi hosts communicating to the SMS (Storage Monitoring Service) on vCenter, then vCenter will not be able to display any of the capabilities of the VSAN datastore, which means you will be unable to build any further storage policies for virtual machine deployments (currently deployed VMs already using VM Storage Policies are unaffected). Even a resynchronization operation fails to reconnect the storage providers to vCenter. This seems to be predominantly related to vCenter servers which were upgraded to vCenter 5.5U1 and not newly installed vCenter servers.
We’ve seen a spate of incidents recently related to the HP Smart Array Drivers that are shipped as part of ESXi 5.x. Worst case scenario – this is leading to out of memory conditions and a PSOD (Purple Screen of Death) on the ESXi host in some cases. The bug is in the hpsa 126.96.36.199-1 driver and all Smart Array controllers that use this driver are exposed to this issue. For details on the symptom, check out VMware KB article 2075978.
This was a tricky one to deal with, as one possible step might be to roll back/downgrade the driver to an earlier version. Unfortunately, not only is this not supported (or documented), but you might also find that an older driver may not work with a newer storage controller. The good news is that HP now have a new version of the driver available which fixes the issue. Customers should upgrade to HP Smart Array Controller Driver (hpsa) Version 188.8.131.52-1 (ESXi 5.0 and ESXi 5.1) or Version 184.108.40.206-1 (ESXi 5.5). Details on where to locate the driver and how to upgrade it are located in their advisory. Think about doing this as soon as possible.
Many readers will be aware of an ongoing issue with NFS in ESXi 5.5U1. My colleague, Duncan, wrote an article about it on his blog site recently entitled – Alert: vSphere 5.5 & NFS issue. Essentially, your NFS datastore may experience an APD (All Paths Down) condition. The issue is also described in KB article 2076392.
I’m pleased to say that VMware has now produced a patch to address this issue. The patch is 5.5EP4 (June 2014) and can be downloaded from VMware’s patch repository site here and will address this issue. Search on ESXi (Embedded and Installable), version 5.5.0. Another KB article, 2077360, has more information about the patch fix.