VSAN 6.1 New Feature – Handling of Problematic Disks

Cormac

9 years ago

The more observant of you may have observed the following entry in the VSAN 6.1 Release Notes: Virtual SAN monitors solid state drive and magnetic disk drive health and proactively isolates unhealthy devices by unmounting them. It detects gradual failure of a Virtual SAN disk and isolates the device before congestion builds up within the affected host and the entire Virtual SAN cluster. An alarm is generated from each host whenever an unhealthy device is detected and an event is generated if an unhealthy device is automatically unmounted. The purpose of this post is to provide you with a little bit more information around this cool new feature.

History

Let’s give a little back history first of all. We have had some escalations in the past where either a solid state drive or a magnetic disk drive was misbehaving. In one case, it was constantly reporting errors, but had not actually failed. The net result was that this single disk drive was introducing poor performance to the cluster overall. The objective of this new feature is to have a mechanism that monitors for these misbehaving drives, and isolate them so that they do not impact the overall cluster. What we look for is a significant period of high latency on the SSD or the magnetic disk drives. If this sustained period of high latency is observed, then VSAN will unmount either the disk (in the case of a capacity device) or the disk group on which the disk resides (in the case of a cache device). The components on the disk/disk group will be marked as “Absent”, and after the clomd timer has expired, the components will be rebuilt elsewhere in the cluster. What this means is that the performance of the virtual machines can be consistent, and will not be impacting by this one misbehaving drive.

Detection

I guess the main question from administrators/operators is how can I tell that VSAN has done such as operation? Well, there are a bunch of different VOBs (VMware Observations) are raised if this happens. For example, reads from a disk exceeding a read latency threshold that will lead to unmounting a VSAN SSD or MD, either of the following two messages could be generated.

WARNING – READ Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
WARNING – READ Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.

For exceeding a write latency threshold that will lead to unmounting a VSAN cache device or capacity, either of the following two messages could be generated:

WARNING – WRITE Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
WARNING – WRITE Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.

Again, to reiterate, if it is the cache device, the whole of the disk group is impacted.

For exceeding a read latency warning threshold that will generate a warning VOB message, the following message will be generated.

WARNING – Half-life READ Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.

For exceeding a write latency warning threshold that will generate a warning VOB message, the following message will be generated.

WARNING – Half-life WRITE Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.

Trigger/Threshold

The trigger threshold is 50ms. Here is an example of the event occurring on an actual host with a below spec SSD:

2015-09-15T02:21:27.270Z cpu8:89341)VSAN Device Monitor: WARNING – READ Average Latency on VSAN device naa.6842b2b006600b001a6b7e5a0582e09a has exceeded threshold value 50 ms 1 times.
2015-09-15T02:21:27.570Z cpu5:89352)VSAN Device Monitor: Unmounting VSAN diskgroup naa.6842b2b006600b001a6b7e5a0582e09a

Caution for home lab users

VSAN is an enterprise class HCI system. We support customers running tier1 workloads on VSAN. We have an extensive HCL which covers devices, controllers, drivers and firmware. Any correctly behaving device selected from the HCL should never encounter these latency maximums. Only misbehaving devices should show this sort of behaviour.

However, VMware realizes that many of our customer, partners and indeed employees run their own home labs with devices that are not on the HCL. Since this feature is enabled out-of-the-box, a word of caution to you folks: your consumer grade devices used in your home labs, if pushed to their limits, may experience high latency values and thus might lead to this feature in VSAN 6.1 unmounting your datastores. To avoid this situation, there are two advanced parameters that will prevent the disk group from unmounting:

Disable VSAN Device Monitoring (and subsequent unmounting of diskgroup):
# esxcli system settings advanced set -o /LSOM/VSANDeviceMonitoring -i 0 <— default is “1″

-or-

Disable VSAN Slow Device Unmounting (continues monitoring):
# esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount -i 0 <— default is “1″

It might be a good idea to turn off these features immediately on home labs, before starting any virtual machines, or maintenance mode type operations where data is being migrated between hosts. This is also true for readers planning to upgrade their home labs to VSAN 6.1 (vSphere 6.0u1).

More information can be found in the official KB article 2132079.

Conclusion

This is a great enhancement to the VSAN feature set. I think this will take care of those devices which might be on the verge of failing, but have not failed yet. The problem with those devices is that they impact everything else. This feature will mitigate that sort of issue in your VSAN cluster.