ATS Miscompare revisited in vSphere 6.5
Does anyone remember the ATS Miscompare issue? This blog post from 2 years ago might jog your memory. It is basically an issue that arose when we began using ATS, the VAAI Atomic Test and Set primitive, for maintaining the ‘liveness’ of a heartbeat in vSphere 5.5U2. After making this change, a number of customers started to see “ATS Miscompare detected between test and set HB images” messages after upgrading to vSphere 5.5U2 or later. The HB reference in the message is shorthand for heartbeat.
In previous releases, we did not use ATS for maintaining the ‘liveness’ of a heartbeat. We used to only use ATS in the following cases:
- Acquire a heartbeat
- Clear a heartbeat
- Replay a heartbeat
- Reclaim a heartbeat
But it seems that the change to also use ATS maintain ‘liveness’ of heartbeats led to some issues.
What is ATS?
ATS is also called hardware assisted locking and was initially introduced as a replacement lock mechanism for SCSI reservations on VMFS volumes. This was done to improve performance and avoid lock contention when doing metadata updates. ATS is a standard T10 command and uses opcode 0×89 (COMPARE AND WRITE). Basically ATS locks can be considered a mechanism for modifying a disk sector. This was far more graceful than the SCSI reservation approach which locked the whole of the LUN. Instead, ATS is locking the area on disk that we wish to update, rather than the whole disk. Historically, this was just to allow an ESXi host to do a metadata update on a VMFS. An example of when a metadata update would be needed would be when allocating space to a VMDK during provisioning. Certain characteristics would need to be updated in the metadata to reflect the new size of the file. But even something as simple as updating the timestamp of when the file was last accessed also needed a metadata update. In vSphere 5.5U2, the enhancement was to use ATS to lock the heartbeat region and update the time stamp on the heartbeat to let other hosts know that this host is still alive and maintaining its lock(s).
What is a heartbeat?
The heartbeat region of VMFS is used for on-disk locking. Every host that uses the VMFS volume has its own heartbeat region, and when the heartbeat region is updated, a host’s locks are maintained. The heartbeat region that gets updated is a time stamp, which tells others hosts that this host is alive, and its locks should not be broken. When the host is down, this region is used to communicate lock state to other hosts. This allows other hosts to take ownership of files that were previously owned by the downed host, for example, in the event of a HA fail-over.
What was the problem?
Well, there were a couple of issues. The first was the storage array misbehaving under certain circumstances. In certain cases, we got ATS miscompares when the array was overloaded. In another scenario, it was due to reservations on the LUN on which the VMFS resided. In another case, the ATS “set” was correctly written to disk, but the array still returned a miscompare.
But it wasn’t all due to the arrays. We found situations where VMFS also detected an ATS miscompare incorrectly. In this case, a heartbeat I/O (1) got timed-out and VMFS aborted that I/O, but before aborting the I/O, the I/O (ATS “set”) actually made it to the disk. VMFS next re-tried the ATS using the original “test-image” in step (1) since the previous one was aborted, and the assumption was that the ATS didn’t make it to the disk. Since the ATS “set” made it to the disk before the abort, the ATS “test” meant that the in-memory and on-disk images no longer matched, so the array returned “ATS miscompare”.
When an ATS miscompare is received, all outstanding IO is aborted. This led to additional stress and load being placed on the storage arrays.
What have we done to improve it?
In vSphere 6.5, there are new heuristics added so that when we get an ATS miscompare event, VMFS reads the on-disk heartbeat data and verifies it against the ATS “test” and “set” images to verify that there is actually a real miscompare. If the miscompare is real, then we do the same as before, which is to abort all the outstanding I/O. If the on-disk heartbeat data has not changed, then this is a false miscompare. In the event of a false miscompare, VMFS will not immediately abort IOs. Instead, VMFS will now re-attempt the ATS heartbeat operation after a short interval (usually less than 100ms).
This should hopefully address the ATS miscompare issues seen with previous versions of vSphere. To learn more about storage improvements in vSphere 6.5, check out this core storage white paper.
Great info! Do you you know if the change is going to be backported to 5.5 and 6.0 braches?
[Update – 29-Aug-17] I’ve just been told that the changes are not specific to VMFS-6, they are applicable for VMFS-5 too.
Also, the hardening changes have been backported to vSphere 6.0 U3 release.
There are no plans yet to back port to vSphere 5.5.
Besides the aweful bug in the qlogic cards to interpret T10 with 1024 instead of expected 512 length, was fixed in >1.1.54 qlnative
Ugh! Thanks for sharing Andreas.
Intresting change to be aware of given I’ve just implemented a 6.5U1 / VMFS6 / 3PAR environment with the latest firmware across the board and I’m seeing ATS mismatches in the vmkernel log constantly. I have a support case open with VMware and HP to find the cause.
Andy, what 3PAR OS version are you on? HPE release ATS improvements in code some time back and that has been very good for us after this.
I dread going through another round of ATS issues… 🙁
This is a brand new 3PAR with the latest OS 3.3.1 and using iSCSI.
HPE appeared to be aware of an issue, but so far have directed us to a default set of things to try. One of those isn’t in the best practices doc, “delayedack” related to iSCSI and so I need try that and report back. I will update here when we get to the bottom of the issue.
Were you able to get to the bottom of this? We also see a HP array showing this, and also another SAN showing this. We narrowed it down to the combination of VMFS6 and ESX 6.5.
ESX5.5U2 and VMFS5, ESX 6.0U3a and VMFS5 works, and ESX 6.5 and VMFS5 does not have any issues with ATS miscompare prints in the logs..
Its only when VMFS6 and ESX 6.5 come into the picture that the ATS miscompare prints come.. and more interestingly it comes when any virtual machine task has started, like create snapshot, power on etc.. ATS heartbeat are all disabled on the ESX servers for a long time..
I meant to say
ESX5.5U2 and VMFS5, ESX 6.0U3a and VMFS5 works, and ESX 6.5 and VMFS5 does NOT have any traces of ATS miscompare prints in the logs..
Maybe it’s an “Andy” thing (haha), but we just had to disable ATS on an environment of ours as well as it was causing unrecoverable VM failures. The logs had something to the tune of 7,500 ATS miscompare errors over the course of 3 days.
We’re using a XIOTech SAN and have never had a problem with miscompares before migrating to 6.5u1 from 6.0. We have open cases with XIO and VMWare. Now every time our hosts reboot we get miscompares and an “all paths down” error, and we have to manually rescan the datastore and everything is back to normal. This happens on every host reboot. So they added extra checks that are getting it wrong?
I’ve not heard about this issue, but I strongly recommend following up with our GSS organization to isolate the root cause.
Same problem here with ESXi 6.5 U1 – Dell MD3820i iSCSI SAN – VMFS6 Datastores