VSAN 6.1 New Feature – Handling of Problematic Disks

The more observant of you may have observed the following entry in the VSAN 6.1 Release Notes: Virtual SAN monitors solid state drive and magnetic disk drive health and proactively isolates unhealthy devices by unmounting them. It detects gradual failure of a Virtual SAN disk and isolates the device before congestion builds up within the affected host and the entire Virtual SAN cluster. An alarm is generated from each host whenever an unhealthy device is detected and an event is generated if an unhealthy device is automatically unmounted. The purpose of this post is to provide you with a little bit more information around this cool new feature.

History

disk-failureLet’s give a little back history first of all. We have had some escalations in the past where either a solid state drive or a magnetic disk drive was misbehaving. In one case, it was constantly reporting errors, but had not actually failed. The net result was that this single disk drive was introducing poor performance to the cluster overall. The objective of this new feature is to have a mechanism that monitors for these misbehaving drives, and isolate them so that they do not impact the overall cluster. What we look for is a significant period of high latency on the  SSD or the magnetic disk drives. If this sustained period of high latency is observed, then VSAN will unmount either the disk (in the case of a capacity device) or the disk group on which the disk resides (in the case of a cache device). The components on the disk/disk group will be marked as “Absent”, and after the clomd timer has expired, the components will be rebuilt elsewhere in the cluster. What this means is that the performance of the virtual machines can be consistent, and will not be impacting by this one misbehaving drive.

Detection

disk-error-detectI guess the main question from administrators/operators is how can I tell that VSAN has done such as operation? Well, there are a bunch of different VOBs (VMware Observations) are raised if this happens. For example, reads from a disk exceeding a read latency threshold that will lead to unmounting a VSAN SSD or MD, either of the following two messages could be generated.

  • WARNING – READ Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
  • WARNING – READ Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.​

For exceeding a write latency threshold that will lead to unmounting a VSAN cache device or capacity, either of the following two messages could be generated:

  • WARNING – WRITE Average latency on VSAN device %s is %d ms an higher than threshold value %d ms.
  • WARNING – WRITE Average Latency on VSAN device %s has exceeded threshold value %d ms %d times.​

Again, to reiterate, if it is the cache device, the whole of the disk group is impacted.

For exceeding a read latency warning threshold that will generate a warning VOB message, the following message will be generated.

  • WARNING – Half-life READ Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.​

For exceeding a write latency warning threshold that will generate a warning VOB message, the following message will be generated.

  • WARNING – Half-life WRITE Average Latency on VSAN device %s is %d and is higher than threshold value %d ms.​

Trigger/Threshold

The trigger threshold is 50ms. Here is an example of the event occurring on an actual host with a below spec SSD:

2015-09-15T02:21:27.270Z cpu8:89341)VSAN Device Monitor: WARNING – READ Average Latency on VSAN device naa.6842b2b006600b001a6b7e5a0582e09a has exceeded threshold value 50 ms 1 times.
2015-09-15T02:21:27.570Z cpu5:89352)VSAN Device Monitor: Unmounting VSAN diskgroup naa.6842b2b006600b001a6b7e5a0582e09a

Caution for home lab users

cautionVSAN is an enterprise class HCI system. We support customers running tier1 workloads on VSAN. We have an extensive HCL which covers devices, controllers, drivers and firmware. Any correctly behaving device selected from the HCL should never encounter these latency maximums. Only misbehaving devices should show this sort of behaviour.

However, VMware realizes that many of our customer, partners and indeed employees run their own home labs with devices that are not on the HCL. Since this feature is enabled out-of-the-box, a word of caution to you folks: your consumer grade devices used in your home labs, if pushed to their limits, may experience high latency values and thus might lead to this feature in VSAN 6.1 unmounting your datastores. To avoid this situation, there are two advanced parameters that will prevent the disk group from unmounting:

  • Disable VSAN Device Monitoring (and subsequent unmounting of diskgroup):
    # esxcli system settings advanced set -o /LSOM/VSANDeviceMonitoring -i 0    <— default is “1″

-or-

  • Disable VSAN Slow Device Unmounting (continues monitoring):
    # esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount -i 0   <— default is “1″

It might be a good idea to turn off these features immediately on home labs, before starting any virtual machines, or maintenance mode type operations where data is being migrated between hosts. This is also true for readers planning to upgrade their home labs to VSAN 6.1 (vSphere 6.0u1).

More information can be found in the official KB article 2132079.

Conclusion

This is a great enhancement to the VSAN feature set. I think this will take care of those devices which might be on the verge of failing, but have not failed yet. The problem with those devices is that they impact everything else. This feature will mitigate that sort of issue in your VSAN cluster.

15 comments
  1. This feature sounds really great from a performance-preserving perspective. Can you weigh in on how this might impact my ability to get $vendor to replace my going-to-fail-but-haven’t-failed-yet drives under warranty/service contract?

    Is VMware going to try and work with vendors to get buy-in on replacing disks when VSAN decides to fail them?

    • Can’t really help here I’m afraid. I suspect this is a conversation that needs to be had with the vendor in question.

      I’m not aware of any conversations going on regarding disk replacement guidelines – all I can say is that this feature will hopefully avoid any dodgy disks impacting the whole VSAN cluster going forward.

      • I did expect an answer like this. I would just like to suggest the VMware consider features like this holistically.

        For a customer that runs a 1000+ disk VSAN, replacing disks at cost as opposed to relying on the warranty/service contract could be a huge barrier to adopting this feature.

        Which isn’t great especially when one of the big advantages VSAN is the lower capital costs.

        With technologies such as EVO:{Rail,Rack}, VMware has shown interest in partnering with hardware vendors. It should consider leveraging these relationships to get vendors onboard with features like this.

        Great work! I look forward to more “smart” features to come out of the VSAN product.

  2. Hi.

    Interesting but what about a possible cassade effect if the VSAN has been sized wrong. And there is high latency due to over use? Will a drive be unmounted then load move to others and impact on them?

    Phil @the_vmonkey

    • That would be the case I reckon. If the cluster has been undersized, but you want to continue running with the high latency, I’d recommend disabling the feature. You can remount the disk groups from the CLI in a single command if they were inadvertently unmounted.

      • Maybe. The mechanism should look at the rest of the cluster to see if drive is abnormally higher than the others and factor in of not participating in a rebuild operation?

        Might be not a issue now. But more likely in the future if monitoring and capacity management not too good.

  3. Maybe im missing something, but why is the entire diskgroup unmounted instead of the misbehaving disk? Unless the caching SSD is the misbehaving part, but when a storage tier disk is faulty unmounting the whole diskgroup sounds as a very big overkill as there is besides optional striping no raid at disk(group) level and causing vsan having to rebuild more VMs componenten then strictly necessary.

    As is stands now i would disable this problematic disk handling feature, its in my opinion too dangerous to use. The reason is that if it so happens that two hosts have a misbehaving disk at the same time and vsan unmounts both diskgroups in the worst case scenario a VM that is unaffected by the problematic disks have unplanned downtime and possibly dataloss because its unlucky to have both mirrors in the unmounted diskgroups.

  4. Thanks for this article. I wish I read it BEFORE spending 48 straight hours on the phone with VMware Support.

    The net effect of this feature (which we will call a bug) is that VSAN will automatically shut off whole disk groups based on whatever they think the best performance variables should be for VSAN.

    This is dangerous because if you have a host failure, a simple re-sync operation will cause all of your hosts to generate high disk io simultaneously. We had roughly 50% of our hosts fail due to lsomSlowDeviceUnmount deciding that it was time to shut off the disk group.

    I highly recommend that you turn this feature off until you do some serious load testing. Or you can learn the hard way like we did and take most of your VMs offline within minutes.

    In our case, the 6.0.U1 update was a requirement by VMware to fix a severe network bug they had in the 6.0b release that would randomly lock the hosts up. They passed this feature/bug into this update without alerting anyone.

    Though I appreciate the perseverance of VMware’s support engineers, this brain damage could have been avoided with better communications.

    • Hi

      That was my thought might happen – see post above “philjusthost on September 22, 2015 at 2:58 pm said:”

      The solution would be for VSAN to take the picture as a whole – If the whole disk group is busy and is part of a rebuild of a stripe – then think – must be bust due to that.

      Phil
      @The_vMonkey

  5. Hi Cormac,

    I have to say I concur with the comments – this is not a good situation.

    VMware is clearly trying to do the right thing, but it is exposing the downsides of a software-defined HCI solution.

    I am a great believer in the fact that there are always trade-offs with all technology architectures – “what one gives with one hand it takes with the other”.

    VSAN when compared to a shared storage platform (SAN or NAS) has some advantages (i.e. simplicity and bring your own hardware), but with these strengths comes downsides.

    1. If VMware is going to predict drive failures then this needs to be supported by the hardware vendor so the drive is swapped out under warranty – as it would be with a storage array – this is one of the downsides of bring your own hardware.

    2. I have discussed this many times before, but I do not see how VSAN can be considered to have enterprise class availability (and therefore be enterprise class) until the failure of a single SSD does not take down the entire disk group and you can have double disk protection with good usable capacity

    The sooner we get Erasure Coding and the ability to have more than one SSD per disk group the better.

    It is a long journey to truly making it enterprise class – just saying it is does not make it so.

    Best regards
    Mark

  6. I wish I’ve read this post before I spend my whole weekend fixing my VSAN cluster. After update 1 my whole VSAN failed (with VCenter on it) due to two hosts which had unmounted diskgroups.

    I wish I’ve been warned before the whole diskgroup was unmounted. Or maybe I just need to read the update notices better.

    For now I’m back at 6.0 without update 1. Maybe trying update 1 one time when I have enough spare time afterwards!

  7. Cormac, Thanks for another great article as always. Home Lab only – besides turning off the commands above – How do you turn of all VSAN alerts especially about HCL and controllers?
    Thanks
    Tom

  8. I found this problem is affecting VSAN in Version 5.5 as well.

    By default I run the following scripts on all hosts BEFORE adding them to any VSAN cluster.

    esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount –int-value 0

    esxcli system settings advanced set -o /VSAN/ClomRepairDelay -i 180

Comments are closed.