Heads Up! ATS Miscompare detected between test and set HB images

heartbeatI’ve been hit up this week by a number of folks asking about “ATS Miscompare detected between test and set HB images” messages after upgrading to vSphere 5.5U2 and 6.0. The purpose of this post is to give you some background on why this might have started to happen.

First off, ATS is the Atomic Test and Set primitive which is one of the VAAI primitives. You can read all about VAAI primitives in the white paper. HB is short for heartbeat. This is how ownership of a file (e.g VMDK) is maintained on VMFS, i.e. lock. You can read more about heartbeats and locking in this blog post of mine from a few years back. In a nutshell, the heartbeat region of VMFS is used for on-disk locking, and every host that uses the VMFS volume has its own heartbeat region. This region is updated by the host on every heartbeat. The region that is updated is the time stamp, which tells others that this host is alive. When the host is down, this region is used to communicate lock state to other hosts.

In vSphere 5.5U2, we started using ATS for maintaining the heartbeat. Prior to this release, we only used ATS when the heartbeat state changed. For example, referring to the older blog, we would use ATS in the following cases:

  • Acquire a heartbeat
  • Clear a heartbeat
  • Replay a heartbeat
  • Reclaim a heartbeat

We did not use ATS for maintaining the ‘liveness’ of a heartbeat. This is the change that was introduced in 5.5U2 and which appears to have led to issues for certain storage arrays.

This week, IBM posted this Flash (Alert) ssg1S1005201 on the “ATS Miscompare” messages, stating the following:

Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.

To clarify, VMware is not doing anything incorrect by using ATS for heartbeat liveness. What was introduced is a more aggressive use of ATS. In the IBM case, there is a recommendation to disable ATS for heartbeats and revert back to the older mechanism of maintain heartbeat liveness. This change does not require a reboot, FYI. What this basically does is sets the behaviour back to the way it was before 5.5U2.

This change to more widespread use of ATS may not just manifest itself on IBM storage arrays. But, having said that, the vast majority of customers running on 5.5U2 or later do not appear to be seeing any issues. Their arrays appear to be handling this change in ATS usage absolutely fine.

For anyone else who observes these messages on other storage arrays after upgrading to 5.5U2 (or vSphere 6.0), please don’t simply disable ATS heartbeats. I’d recommend that you have a conversation with your array vendor for guidance. They may already have a fix for the issue.

 Our GSS folks are working on a knowledgebase which we hope to publish very soon.

[Update – April 21st, 2015] KB article 2113956 is now published.

A link to an article discussing ATS Miscompare on EMC VMAX.

A link to an article discussing ATS Miscompare on SolidFire.

13 comments
  1. This is certainly a KB doc that’s near and dear to my heart, as I’ve been on numerous calls with IBM and VMware (and other vendors) for this very issue.

    When reverting to the older heartbeat mechanism, we were told (by VMware) that a reboot is not required, but the setting will not take place until the storage is idle – that is, until there are no disk reads or writes happening. As a result, you can change the setting but it’s not easy to track when it actually goes into effect. Some of our clusters host very busy database VMs that write more or less constantly to the storage. We decided to reboot every host connected to the IBM SVC to ensure the setting was in effect.

    We went from a high of about a dozen PSODs a week before changing the setting to none since the setting has been changed.

      • I would like to clarify this:
        Disabling the use of ATS to create or update the VMFS HB does not require a reboot. However this is different from disabling or enabling the ATS VAAI primitive itself which would require a host reboot.
        Since the latter is not changed by changing the former, host reboot is not required.

  2. The KB Mentions that this issue is on IBM Storwize. Are there any other Array vendors such as NetApp or Hitachi who seem to have faced similar issues

    • None to my knowledge at this time. If I hear of any others, I’ll update the post (and I suspect the KB will also be updated to reflect any other occurrences)

      • According to VMware (I recently opened a case) no other vendor or product is affected…just IBM StoreWize and IBM SVC.

  3. Hi,
    we have the same Problem with our Hitachi VSP, and we use also NetApp 8040
    were we have no Problem’s. The ‘ATS Miscompare detected’ Problem leads in frame drops, state in doubt and sometimes LUN-Lock Problem’s. If you rename your LUN-Names in VMware it leads in massivly Locks and your Hosts may be not more responding. We already investigate this problem since last year where our VMware was 5.0. So it looks like an older Problem of VMware. The Hitachi Analyse of the communication between vmware and the Hitachi VSP show us that there is really a BUG in the implemantation of ATS/VAAI wich ends in ATS Miscompare detected. The above KB-Article looks like an workaround which only solve parts of the Problems. We also got this from the VM-Support

    • Hi Michael,
      The change I mentioned was only introduced in 5.5U2. It allowed ATS to be used for maintaining heartbeats.
      If you are seeing the issues on earlier versions of ESX, it suggests this is not the root cause of the issue.

  4. Hello Cormac,

    Thank you so much for this article.

    It seems we experienced same problem with our hosts and 3PAR array (7400 running latest version 3.2.1 MU2) last saturday. All the ESXi clusters were showing lot of dead paths and was unable to do any IO operations on datastores.

    However, we have only one cluster which is on version 5.5 U2 and other 7 clusters were on version 5.0 U2 but we experienced problems on all other clusters.

    We do have some shared luns acorss all clusters (but I believe that number would be 3-5 shared luns)

    The way we could solve problem was by rebooting 3PAR nodes and all hosts in all clusters to bring it to normal state.

    What are your thoughts on this? Why would we have problems on all hosts which were not vesion 5.5. U2?

    We are still trying to find out the cause?

  5. We are seeing the following errors on version 5.5 no update hosts… but no ATS Miscompare messages in the kernel.log

    lost access to volume due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly

Comments are closed.