Component Metadata Health – Locating Problematic Disk

I’ve noticed a couple of customers experiencing a Component Metadata Health failure on the VSAN health check recently. This is typically what it looks like:

component-metadata-healthThe first thing to note is that the KB associated with this health check states the following:

Note: This health check test can fail intermittently if the destaging process is slow, most likely because VSAN needs to do physical block allocations on the storage devices. To work around this issue, run the health check once more after the period of high activity (multiple virtual machine deployments, etc) is complete. If the health check continues to fail the warning is valid. If the health check passes, the warning can be ignored.

With that in mind, let’s continue to figure out which disk has the potentially problematic component. The warning above reports a component UUID, but customers are having difficulty matching this UUID to a physical device. In other words, on which physical disk does this component reside? The only way to locate this currently is through the RVC, Ruby vSphere Console. The following is an example on how you can locate the physical device on which a component of an object resides.

First, using vsan.cmmds_find, search on the component UUID as reported in the health check (components with errors) to get the disk UUID. Some of the preceding columns have been removed for readability, and the command is run against the cluster object (represented by 0):

> vsan.cmmds_find 0 -u dc3ae056-0c5d-1568-8299-a0369f56ddc0
---+---------+-----------------------------------------------------------+
   | Health  | Content                                                   |
---+---------+-----------------------------------------------------------+
   | Healthy | {"diskUuid"=>"52e5ec68-00f5-04d6-a776-f28238309453",      |
   |         |  "compositeUuid"=>"92559d56-1240-e692-08f3-a0369f56ddc0", 
   |         |  "capacityUsed"=>167772160,                               |
   |         |  "physCapacityUsed"=>167772160,                           | 
   |         |  "dedupUniquenessMetric"=>0,                              |
   |         |  "formatVersion"=>1}                                      |
---+---------+-----------------------------------------------------------+
/localhost/Cork-Datacenter/computers>

Now that you have the diskUuid, you can use that in the next command. Once more, some of the preceding columns in the output have been removed for readbility:

> vsan.cmmds_find 0 -t DISK -u 52e5ec68-00f5-04d6-a776-f28238309453
---+---------+-------------------------------------------------------+
   | Health  | Content                                               |
---+---------+-------------------------------------------------------+
   | Healthy | {"capacity"=>145303273472,                            |
   |         |  "iops"=>100,                                         |
   |         |  "iopsWritePenalty"=>10000000,                        |
   |         |  "throughput"=>200000000,                             |
   |         |  "throughputWritePenalty"=>0,                         |
   |         |  "latency"=>3400000,                                  |
   |         |  "latencyDeviation"=>0,                               |
   |         |  "reliabilityBase"=>10,                               |
   |         |  "reliabilityExponent"=>15,                           |
   |         |  "mtbf"=>1600000,                                     |
   |         |  "l2CacheCapacity"=>0,                                |  
   |         |  "l1CacheCapacity"=>16777216,                         |
   |         |  "isSsd"=>0,                                          |   
   |         |  "ssdUuid"=>"52bbb266-3a4e-f93a-9a2c-9a91c066a31e",   |
   |         |  "volumeName"=>"NA",                                  |
   |         |  "formatVersion"=>"3",                                |
   |         |  "devName"=>"naa.600508b1001c5c0b1ac1fac2ff96c2b2:2", | 
   |         |  "ssdCapacity"=>0,                                    |
   |         |  "rdtMuxGroup"=>80011761497760,                       |
   |         |  "isAllFlash"=>0,                                     |
   |         |  "maxComponents"=>47661,                              |
   |         |  "logicalCapacity"=>0,                                |
   |         |  "physDiskCapacity"=>0,                               |
   |         |  "dedupScope"=>0}                                     |
---+---------+-------------------------------------------------------+
>

In the devName field above, you now have the NAA id (the SCSI id) of the disk.

I’ve requested that this information get added to the KB article.

22 Replies to “Component Metadata Health – Locating Problematic Disk”

  1. Wouldn’t it be even MORE helpful if the vsan.cmmds_find command added a field which returned the host on which the component was located? For customers that have large VSAN clusters with multiple disk groups containing several disks each, a certain amount of search would still be required to match the NAA disk ID to the appropriate host. Having any RVC commands which find and return information on components or objects that also state the containing host would be much simpler and more valuable, I think.

    1. Having it in the UI would be even easier, which is the feature request I’m filing. But good point on the RVC command too. I’ll add that

  2. We’re experiencing this problem. Unfortunately, using vsan.cmmds_find to locate the UUID yields no results yet the Health plug-in still showes the failed components. I’ve had an SR open with VMware for a couple weeks. No progress.

      1. Please open SRs for these issues with VMware support. They have tools where they can validate whether this is an actual issue or not, or if it is cosmetic, once they have the logs.

        1. In my case it was deemed to be cosmetic. The object was invalid because it didn’t exist.

      2. VMware states that in our case the component objects are stale. We’re still working with them to get them deleted.

  3. We upgraded our VSAN to 6.2 over the weekend. It looks like during the upgrade the invalid objects were removed.

  4. I’m having an issue after a rebalance on a vSAN 6.2 cluster. I have one disk on a host failing the metadata health. The notes field in the UI shows “Ops backlog (Prepares=0, Commits=11210). The disk shows no faults and appears healthy from the server’s iDRAC/Perc information as well as when running the commands listed about. Any thoughts?

    1. I think the rebalance activity is “overloading” the disk. Wait for the rebalance activity to complete, rerun the health check and see if the issue persists. I suspect that it will go away once rebalance is completed.

      1. Unfortunately waiting for the re-balance and then re-testing still shows the error. I’ve actually ran multiple re-balances and re-tested over the week with the same result. I started an SR with VMware and they too reported all the disks were healthy the the alert should be ignored. Hopefully this will be patched.

        I have noticed that the ” Ops backlog (Prepares=0, Commits = 11212) has gone up in value over time.

  5. When I perform storage migration or clone actions on vSAN, I receive the following messages: Physical disk: Component metadata health: Failed.
    Notes: Ops backlog.
    It looks like when vSAN is under write load these messages will appear randomly. read from vSAN is fast but the writes are so slow that migration and clone actions are failling.

    Current setup:
    Force 10: S4810p

    3x Nodes:
    2x 10Gbit Intel NIC (1 in use for vSAN)
    4x Flash (1x cache en 3x capacity)
    1x Raid Controller Perc H730 in HBA mode (firmware: 25.4.0017)
    ESXi driver: lsi_mr3 version 6.903.85.00-1 (async driver)
    ESXi changed settings:
    esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
    esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

    vSAN network MTU: 9000
    Hardware compatibility: passed

    I just opened an SR at VMware.

        1. VMware support says:

          ‘This message “Ops backlog (Prepares = 0, Commits = 11871)” means that 11871 operations are held in SSD and should be moved at some point to capacity drives,
          in VSAN Health threshold is set to 10000 so that is why you got this error
          this error can show to inability of capacity layer to cope this load, but usually and here that happens in new installations with big SSD cache and just few active VMs
          so please deploy more active VMs and let VSAN work for real ;)’

          We are trying that at the moment.

Comments are closed.