The more observant of you may have observed the following entry in the VSAN 6.1 Release Notes: Virtual SAN monitors solid state drive and magnetic disk drive health and proactively isolates unhealthy devices by unmounting them. It detects gradual failure of a Virtual SAN disk and isolates the device before congestion builds up within the affected host and the entire Virtual SAN cluster. An alarm is generated from each host whenever an unhealthy device is detected and an event is generated if an unhealthy device is automatically unmounted. The purpose of this post is to provide you with a little bit more information around this cool new feature.
History
Detection
- WARNING – READ Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
- WARNING – READ Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.
For exceeding a write latency threshold that will lead to unmounting a VSAN cache device or capacity, either of the following two messages could be generated:
- WARNING – WRITE Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
- WARNING – WRITE Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.
Again, to reiterate, if it is the cache device, the whole of the disk group is impacted.
For exceeding a read latency warning threshold that will generate a warning VOB message, the following message will be generated.
- WARNING – Half-life READ Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.
For exceeding a write latency warning threshold that will generate a warning VOB message, the following message will be generated.
- WARNING – Half-life WRITE Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.
Trigger/Threshold
The trigger threshold is 50ms. Here is an example of the event occurring on an actual host with a below spec SSD:
2015-09-15T02:21:27.270Z cpu8:89341)VSAN Device Monitor: WARNING – READ Average Latency on VSAN device naa.6842b2b006600b001a6b7e5a0582e09a has exceeded threshold value 50 ms 1 times.
2015-09-15T02:21:27.570Z cpu5:89352)VSAN Device Monitor: Unmounting VSAN diskgroup naa.6842b2b006600b001a6b7e5a0582e09a
Caution for home lab users
However, VMware realizes that many of our customer, partners and indeed employees run their own home labs with devices that are not on the HCL. Since this feature is enabled out-of-the-box, a word of caution to you folks: your consumer grade devices used in your home labs, if pushed to their limits, may experience high latency values and thus might lead to this feature in VSAN 6.1 unmounting your datastores. To avoid this situation, there are two advanced parameters that will prevent the disk group from unmounting:
- Disable VSAN Device Monitoring (and subsequent unmounting of diskgroup):
# esxcli system settings advanced set -o /LSOM/VSANDeviceMonitoring -i 0 <— default is “1″
-or-
- Disable VSAN Slow Device Unmounting (continues monitoring):
# esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount -i 0 <— default is “1″
It might be a good idea to turn off these features immediately on home labs, before starting any virtual machines, or maintenance mode type operations where data is being migrated between hosts. This is also true for readers planning to upgrade their home labs to VSAN 6.1 (vSphere 6.0u1).
More information can be found in the official KB article 2132079.
Conclusion
This is a great enhancement to the VSAN feature set. I think this will take care of those devices which might be on the verge of failing, but have not failed yet. The problem with those devices is that they impact everything else. This feature will mitigate that sort of issue in your VSAN cluster.