In previous releases, we did not use ATS for maintaining the ‘liveness’ of a heartbeat. We used to only use ATS in the following cases:
- Acquire a heartbeat
- Clear a heartbeat
- Replay a heartbeat
- Reclaim a heartbeat
But it seems that the change to also use ATS maintain ‘liveness’ of heartbeats led to some issues.
What is ATS?
ATS is also called hardware assisted locking and was initially introduced as a replacement lock mechanism for SCSI reservations on VMFS volumes. This was done to improve performance and avoid lock contention when doing metadata updates. ATS is a standard T10 command and uses opcode 0×89 (COMPARE AND WRITE). Basically ATS locks can be considered a mechanism for modifying a disk sector. This was far more graceful than the SCSI reservation approach which locked the whole of the LUN. Instead, ATS is locking the area on disk that we wish to update, rather than the whole disk. Historically, this was just to allow an ESXi host to do a metadata update on a VMFS. An example of when a metadata update would be needed would be when allocating space to a VMDK during provisioning. Certain characteristics would need to be updated in the metadata to reflect the new size of the file. But even something as simple as updating the timestamp of when the file was last accessed also needed a metadata update. In vSphere 5.5U2, the enhancement was to use ATS to lock the heartbeat region and update the time stamp on the heartbeat to let other hosts know that this host is still alive and maintaining its lock(s).
What is a heartbeat?
The heartbeat region of VMFS is used for on-disk locking. Every host that uses the VMFS volume has its own heartbeat region, and when the heartbeat region is updated, a host’s locks are maintained. The heartbeat region that gets updated is a time stamp, which tells others hosts that this host is alive, and its locks should not be broken. When the host is down, this region is used to communicate lock state to other hosts. This allows other hosts to take ownership of files that were previously owned by the downed host, for example, in the event of a HA fail-over.
What was the problem?
Well, there were a couple of issues. The first was the storage array misbehaving under certain circumstances. In certain cases, we got ATS miscompares when the array was overloaded. In another scenario, it was due to reservations on the LUN on which the VMFS resided. In another case, the ATS “set” was correctly written to disk, but the array still returned a miscompare.
But it wasn’t all due to the arrays. We found situations where VMFS also detected an ATS miscompare incorrectly. In this case, a heartbeat I/O (1) got timed-out and VMFS aborted that I/O, but before aborting the I/O, the I/O (ATS “set”) actually made it to the disk. VMFS next re-tried the ATS using the original “test-image” in step (1) since the previous one was aborted, and the assumption was that the ATS didn’t make it to the disk. Since the ATS “set” made it to the disk before the abort, the ATS “test” meant that the in-memory and on-disk images no longer matched, so the array returned “ATS miscompare”.
When an ATS miscompare is received, all outstanding IO is aborted. This led to additional stress and load being placed on the storage arrays.
What have we done to improve it?
In vSphere 6.5, there are new heuristics added so that when we get an ATS miscompare event, VMFS reads the on-disk heartbeat data and verifies it against the ATS “test” and “set” images to verify that there is actually a real miscompare. If the miscompare is real, then we do the same as before, which is to abort all the outstanding I/O. If the on-disk heartbeat data has not changed, then this is a false miscompare. In the event of a false miscompare, VMFS will not immediately abort IOs. Instead, VMFS will now re-attempt the ATS heartbeat operation after a short interval (usually less than 100ms).
This should hopefully address the ATS miscompare issues seen with previous versions of vSphere. To learn more about storage improvements in vSphere 6.5, check out this core storage white paper.