Heads Up! VAAI UNMAP issues on EMC VMAX

Cormac

10 years ago

We just got notification about a potential issue with the VAAI UNMAP primitive when used on EMC VMAX storage systems with Enginuity version 5876.159.102 or later. It seems that during an ESXi reboot, or during a device ATTACH operation, the ESXi may report corruption. The following is an overview of the details found in EMC KB 184320. Other symptoms include vCenter operations on virtual machines fail to complete and the following errors might be found in the VMkernel logs:

WARNING: Res3: 6131: Invalid clusterNum: expected 2624, read 0 
[type 1] Invalid totalResources 0 (cluster 0).[type 1] Invalid  
nextFreeIdx 0 (cluster 0).

WARNING: Res3: 3155: Volume aaaaaaaa-bbbbbbbb-cccc-dddddddddddd 
("datastore1") might be damaged on the disk. Resource cluster  
metadata corruption has been detected

The VAAI UNMAP primitive is used for reclaiming ‘dead’ or ‘stranded’ space on thin provisioned VMFS datastores. It seems the issue is related to a combination of the VAAI Thin Provisioning Stun primitive and the VAAI UNMAP primitive.

To allow a datastore to return useful SCSI sense data when a device on the storage system cannot allocate any more space, a “NULL UNMAP” command is sent to the storage system from the ESXi. This allows ESXi to trigger a Thin Provision Stun on a VM when it requests additional space on an already full datastore. This command, unlike a normal UNMAP, does not UNMAP any blocks.

A normal UNMAP command however, when sent from the ESXi to the storage system, populates a buffer in the storage system with UNMAP descriptor information. What EMC has found is that the “NULL UNMAP” used for TP Stun does not send a descriptor with the command so the buffer remains uninitialized. This is the root of the issue.

When the VMAX storage system processes the UNMAP commands, the “NULL UNMAP” commands are also processed but since the descriptor information hasn’t initialized the buffer, the command uses residual information found in the internal buffer. This causes blocks which could still be in use to become unmapped.

EMC already have a fix which can be requested through the EMC Support Center. EMC support will verify if your VMAX is exposed to this Enginuity issue and a fix (Fix 72255) has already been written to address the problem. I’ve been informed that this can be requested through the Enginuity Pack (epack) process.

Of course, if you are affected, it is advisable not to use UNMAP (via vmkfstools -y or the new esxcli namespace) until the fix is in place.