First off, ATS is the Atomic Test and Set primitive which is one of the VAAI primitives. You can read all about VAAI primitives in the white paper. HB is short for heartbeat. This is how ownership of a file (e.g VMDK) is maintained on VMFS, i.e. lock. You can read more about heartbeats and locking in this blog post of mine from a few years back. In a nutshell, the heartbeat region of VMFS is used for on-disk locking, and every host that uses the VMFS volume has its own heartbeat region. This region is updated by the host on every heartbeat. The region that is updated is the time stamp, which tells others that this host is alive. When the host is down, this region is used to communicate lock state to other hosts.
In vSphere 5.5U2, we started using ATS for maintaining the heartbeat. Prior to this release, we only used ATS when the heartbeat state changed. For example, referring to the older blog, we would use ATS in the following cases:
- Acquire a heartbeat
- Clear a heartbeat
- Replay a heartbeat
- Reclaim a heartbeat
We did not use ATS for maintaining the ‘liveness’ of a heartbeat. This is the change that was introduced in 5.5U2 and which appears to have led to issues for certain storage arrays.
This week, IBM posted this Flash (Alert) ssg1S1005201 on the “ATS Miscompare” messages, stating the following:
Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.
To clarify, VMware is not doing anything incorrect by using ATS for heartbeat liveness. What was introduced is a more aggressive use of ATS. In the IBM case, there is a recommendation to disable ATS for heartbeats and revert back to the older mechanism of maintain heartbeat liveness. This change does not require a reboot, FYI. What this basically does is sets the behaviour back to the way it was before 5.5U2.
This change to more widespread use of ATS may not just manifest itself on IBM storage arrays. But, having said that, the vast majority of customers running on 5.5U2 or later do not appear to be seeing any issues. Their arrays appear to be handling this change in ATS usage absolutely fine.
For anyone else who observes these messages on other storage arrays after upgrading to 5.5U2 (or vSphere 6.0), please don’t simply disable ATS heartbeats. I’d recommend that you have a conversation with your array vendor for guidance. They may already have a fix for the issue.
Our GSS folks are working on a knowledgebase which we hope to publish very soon.
[Update – April 21st, 2015] KB article 2113956 is now published.
A link to an article discussing ATS Miscompare on EMC VMAX.
A link to an article discussing ATS Miscompare on SolidFire.