VSAN 6.2 Part 10 – Problematic Disk Handling

Cormac

8 years ago

In this post, I want to talk about a feature called Problematic Disk Handling. Some history behind why we have such a feature can be found in this post. In VSAN 6.2/vSphere 6.0 U2, Problematic Disk Handling has been improved so that it will unmount a problematic disk/diskgroup for two reasons:

If VSAN detects excessive write IO latency to a capacity tier disk, the disk will be unmounted. By default, VSAN 6.2 will no longer unmount any disk/diskgroup due to excessive read IO latency, nor will it unmount a caching tier SSD due to excessive write IO latency. VSAN 6.2 will now only unmount a disk if the disk is functioning as a capacity tier disk and it has excessive write IO latency. To unmount a disk, VSAN 6.2 must have detected excessive write IO latency exceeding the statistical mode of 500ms for four, non-consecutive ten minute time intervals that are randomly distributed across a 400 minute (6-7 hour) time period. If you do want to have a disk group unmount due to excessive write latency on the cache tier, the following advanced setting must be set:

        esxcfg-advcfg --set 1 /LSOM/lsomSlowTier1DeviceUnmount

If VSAN has detected a disk previously failed by LSOM (Local Log Structured Object Manager – the internal part of VSAN that works at the physical disk layer), the disk will be unmounted. This could be due to non-transient/recoverable IO error or 60 second aggregate IO timeout. This will cause the disk/diskgroup to be unmounted, but VSAN will attempt a re-mount immediately afterward in the hope that marking the disk as failed was premature and any problem causing the failure is transient in nature and has been recovered. However if the failure is persistent, VSAN will not be able to remount the disk/diskgroup. Many instances of LSOM failed disks are due to one or more lost I/Os involving a flaky driver/firmware/adapter.

Troubleshooting

In general, there are a few things to look for to figure out if problematic disk handling has kicked in:

A vmkernel.log message indicating that a disk has error’ed in LSOM and the disk/diskgroup has been unmounted:

2016-02-10T10:10:51.481Z cpu6:43298)WARNING: LSOM: LSOMEventNotify:6440: 
Virtual SAN device 52db4996-ffdd-9957-485c-e2dcf1057f66 is under 
permanent error.
.
.
2016-02-10T10:17:53.238Z cpu14:3443764)VSAN Device Monitor: Successfully
unmounted failed VSAN diskgroup naa.600508b1001cbbbe903bd48c8f6b2ddb

A hostd log message indicating that a disk/diskgroup has been unmounted:

 event.Unmounting failed VSAN diskgroup

A vCenter Server event message indicating that a disk/diskgroup has been unmounted which should be visible via the vCenter client’s monitor events panel for the particular VSAN host.

eventTypeId = "Re-mounting failed VSAN diskgroup 
naa.600508b1001cbbbe903bd48c8f6b2ddb.",

A reference to “failed” as opposed to “unhealthy” in the vmkernel.log message indicates that LSOM detected that the disk failed (the second scenario from the list above). The reference to “diskgroup” in the same log message indicates that the entire diskgroup is being unmounted as opposed to a single capacity tier disk. Note that this will be the case when (a) the disk that failed in LSOM is the cache device of the disk group or (b) this is a diskgroup on an all-flash VSAN with deduplication has been enabled (thus a disk failure impacts the whole of the disk group).