We just got notification about a potential issue with the VAAI UNMAP primitive when used on EMC VMAX storage systems with Enginuity version 5876.159.102 or later. It seems that during an ESXi reboot, or during a device ATTACH operation, the ESXi may report corruption. The following is an overview of the details found in EMC KB 184320. Other symptoms include vCenter operations on virtual machines fail to complete and the following errors might be found in the VMkernel logs:
WARNING: Res3: 6131: Invalid clusterNum: expected 2624, read 0
[type 1] Invalid totalResources 0 (cluster 0).[type 1] Invalid
nextFreeIdx 0 (cluster 0).
WARNING: Res3: 3155: Volume aaaaaaaa-bbbbbbbb-cccc-dddddddddddd
("datastore1") might be damaged on the disk. Resource cluster
metadata corruption has been detected
My good pal Duco Jaspars pinged me earlier this week about an issue that was getting a lot of discussion in the VMware community. Duco also pointed me to a blog post by Andreas Peetz where he described the issue in detail here.
The symptom is that the ESXi hostd process becomes unresponsive when software iSCSI is enabled. There is another symptom where an ESXi boot hangs after message “iscsi_vmk loaded successfully” or “vmkibft loaded successfully”. This has only been only observed with the ESXi 5.5 U1 Driver Rollup ISO. It has not been reported by customers using the standard ESXi 5.5U1 media. The VMware ESXi 5.5 Update 1 Driver Rollup provides an installable ESXi ISO image that includes drivers for various products produced by VMware partners.
Initially it was reported in the community that it appeared to be an issue with the Diablo TeraDimm driver that was shipped as part of the roll-up. However further investigation has concluded that the Emulex
So for those of you that plan a 5.5U1 deployment and also use software iSCSI, heads up if you plan on using the ESXi 5.5U1 Driver Rollup ISO (which is only supported for use with new installs by the way, and not upgrades).
be2iscsi driver is at fault and is the root cause. VMware support are recommending that you use an updated be2iscsi driver as per KB article 2075171 to address the issue.
Hmm, it seems to be the week that’s in it for storage issues. After publishing the DELL EQL & VMFS issue earlier this week, I have now been given a heads-up on an EMC VNXe & iSCSI issue. The symptoms are ESXi hosts being unable to boot from an iSCSI LUN on the VNXe or ESXi hosts losing connectivity to iSCSI datastores.
Our GSS folks just released KB article 2049103 which details a VMFS Heartbeat and Lock corruption issue that manifests itself on DELL EqualLogic storage arrays when running PS Series firmware v6.0.6. As per the KB:
A VMFS datastore has a region designated for heartbeat types of operations to ensure that distributed access to the volume occurs safely. When files are being updated, the heartbeat region for those files is locked by the host until the update is complete.
In this scenario, the heartbeat region has become corrupt.
I just got a notification about this myself today. Apparently there is some interoperability issues with VAAI (vSphere APIs for Array Integration) & EMC RecoverPoint on EMC VNX arrays. It looks like the VNX Storage Processor (SP) may reboot with Operating Environment Release 32 P204 in a RecoverPoint environment.
EMC have just released today a technical advisory – ETA emc327099 – which describes the issue in more detail but is basically advising customers to disable VAAI on all ESXi hosts in the RecoverPoint environment while they figure this out. Hopefully it won’t take too long to come up with a solution to allow VAAI run in these environments once again.
Thanks to our friends over at EMC (shout out to Itzik), we’ve recently been made aware of a limitation on our UNMAP mechanism in ESXi 5.0 & 5.1. It would appear that if you attempt to reclaim more than 2TB of dead space in a single operation, the UNMAP primitive is not handling this very well. The current thought is that this is because we have a 2TB (- 512 byte) file size limit on VMFS-5. When the space to reclaim is above this size, we cannot create the very large temporary balloon file (part of the UNMAP process), and it spews the following errors:
Just thought I’d bring to your attention something that has been doing the rounds here at VMware recently, and will be applicable to those of you using QLogic HBAs with ESXi 5.x. The following are the device queue depths you will find when using QLogic HBAs for SAN connectivity:
- ESXi 4.1 U2 – 32
- ESXi 5.0 GA – 64
- ESXi 5.0 U1 – 64
- ESXi 5.1 GA – 64
The higher depth of 64 has been this way since 24 Aug 2011 (the 5.0 GA release). The issue is that this has not been documented anywhere. For the majority of users, this is not an area of concern and is probably a benefit. But there are some concerns.