Degraded Device Handling (DDH) Revisited
Degraded Device Handling (DDH) or Dying Disk Handling as it was formerly known, is a feature that has been available in vSAN for some time. However, I regularly get questions about how it works. The DDH behavior has changed significantly over various versions. We may as well begin this post with an overview about the purpose of DDH and then get into the different sort of behaviors.
First of all, the reason behind a feature such as DDH is to help avoid cluster performance degradation due to an unhealthy drive. In the early days of vSAN, we had come across multiple instances of where drives were what might be termed “flaky”. This is where the drive is not completely failed, but had erratic behavior and was generating lots of IO retries and IO errors. How could we deal with such a situation? This is what led to the creation of DDH.
In the very first iteration of DDH back in vSAN 6.1, vSAN monitored drives for excessive read and/or write latency. If it observed that the average latency of a drive was higher than 50ms over 10-minute period, then it triggered a dismount of the drive in question. Components on dismounted drive were then marked as absent, and a rebuild would commence after 60 minutes had passed (defined by repair delay timer). This led to a number of challenges, and a considerable number of support requests mainly due to some false positives, primarily from drives temporarily reporting higher average latencies. It was obvious that we needed to introduce a better unhealthy drive detection method.
We made a bunch of enhancements in vSAN 6.2 and vSAN 6.5. Now the average latencies were tracked over multiple, randomly-selected 10-minute intervals. We also changed which devices DDH operated on. We no longer did a dismount of cache or capacity device when we observed high read latencies. We also made a change so that DDH does not dismount cache device due to high write latencies. Therefore, DDH would now only dismount capacity devices with high write latencies. The latency threshold for a magnetic disk was set at 500 milli-seconds for write IOs (this would be a hybrid vSAN). The IO latency threshold for write IOs to an SSD was set at 200 milli-seconds (this would be an all-flash vSAN).
To be sure that a high write latency on a capacity drive is not a false positive, we now started to take multiple samples. A disk is now said to be unhealthy only when the average write IO round trip latency exceed the above thresholds four times in a six hour period.
Finally a new remount mechanism was added. After DDH initiated an unmount, remount is tried on the failed or dismounted disks to address transient issues. This is attempted approximately 24 times over 24-hour period. However, we observed an unwanted behavior in this version. It was found that a drive could be dismounted without migrating (rebuilding) the data on it to other storage elsewhere in the vSAN cluster. This is an issue when there is only one remaining copy of the data on the problematic drive and DDH goes ahead and unmounts it. We now needed to ensure that DDH never tries to unmount a drive container the last copy of data.
In vSAN 6.6, this undesirable behavior was addressed. When device is degraded in vSAN 6.6 (and later), components are evaluated. If component does not belong to last replica, DDH can go ahead and mark it as absent. This results in ”lazy” (after 60 minutes) evacuation/rebuild since another replica of the object still exists. However, if component belongs to last replica, it now starts a data evacuation from the device immediately.
One other useful feature of DDH is that once it detects an unhealthy disk, it triggers an attempt to log key disk SMART attributes. S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology is a mechanism for monitoring and detecting of errors on magnetic disks and Solid State Drives (SSDs). We log the following SMART attributes that will hopefully give you an idea as to why the device was behaving erratically in the first place, and why DDH chose to unmount it from vSAN.
- Re-allocated sector count. Attribute ID = 0x05.
- Uncorrectable errors count. Attribute ID = 0xBB.
- Command timeouts count. Attribute ID = 0xBC.
- Re-allocated sector event count. Attribute ID = 0xC4.
- Pending re-allocated sector count. Attribute ID = 0xC5.
- Uncorrectable sector count. Attribute ID = 0xC6.
That completes the DDH revisited post. Like I said at the beginning, the whole idea is to mask out drives that may have intermittent performance issues from the rest of the vSAN cluster. If you do come across a DDH event, the log messages may indicate why exactly did DDH unmount the drive in question.
Interesting that S.M.A.R.T. data, if available, is not used in deciding if the D{isk|evice} is D{ead|egraded} or not. Would you mind sharing the reason behind “trying” to remount ?
Thinking about it, I assume the main reason is to validate whether or not there was an actual issue with the device, or whether the device may have been seen significant latency issues due to some runaway workload or some other issues in the system e.g. lots of rebuild/resync activity. A lot has been done in 6.7 and 6.7U1 to mitigate these issues, but I guess there could still be some corner cases.
Let me ask about our plans to look at the SMART data. I seem to remember we were discussing this in the past, but not sure what became of it.