Component Metadata Health – Locating Problematic Disk

by CormacMarch 18, 2016March 16, 2016

I’ve noticed a couple of customers experiencing a Component Metadata Health failure on the VSAN health check recently. This is typically what it looks like:

The first thing to note is that the KB associated with this health check states the following:

Note: This health check test can fail intermittently if the destaging process is slow, most likely because VSAN needs to do physical block allocations on the storage devices. To work around this issue, run the health check once more after the period of high activity (multiple virtual machine deployments, etc) is complete. If the health check continues to fail the warning is valid. If the health check passes, the warning can be ignored.

With that in mind, let’s continue to figure out which disk has the potentially problematic component. The warning above reports a component UUID, but customers are having difficulty matching this UUID to a physical device. In other words, on which physical disk does this component reside? The only way to locate this currently is through the RVC, Ruby vSphere Console. The following is an example on how you can locate the physical device on which a component of an object resides.

First, using vsan.cmmds_find, search on the component UUID as reported in the health check (components with errors) to get the disk UUID. Some of the preceding columns have been removed for readability, and the command is run against the cluster object (represented by 0):

> vsan.cmmds_find 0 -u dc3ae056-0c5d-1568-8299-a0369f56ddc0
---+---------+-----------------------------------------------------------+
   | Health  | Content                                                   |
---+---------+-----------------------------------------------------------+
   | Healthy | {"diskUuid"=>"52e5ec68-00f5-04d6-a776-f28238309453",      |
   |         |  "compositeUuid"=>"92559d56-1240-e692-08f3-a0369f56ddc0", 
   |         |  "capacityUsed"=>167772160,                               |
   |         |  "physCapacityUsed"=>167772160,                           | 
   |         |  "dedupUniquenessMetric"=>0,                              |
   |         |  "formatVersion"=>1}                                      |
---+---------+-----------------------------------------------------------+
/localhost/Cork-Datacenter/computers>

Now that you have the diskUuid, you can use that in the next command. Once more, some of the preceding columns in the output have been removed for readbility:

> vsan.cmmds_find 0 -t DISK -u 52e5ec68-00f5-04d6-a776-f28238309453
---+---------+-------------------------------------------------------+
   | Health  | Content                                               |
---+---------+-------------------------------------------------------+
   | Healthy | {"capacity"=>145303273472,                            |
   |         |  "iops"=>100,                                         |
   |         |  "iopsWritePenalty"=>10000000,                        |
   |         |  "throughput"=>200000000,                             |
   |         |  "throughputWritePenalty"=>0,                         |
   |         |  "latency"=>3400000,                                  |
   |         |  "latencyDeviation"=>0,                               |
   |         |  "reliabilityBase"=>10,                               |
   |         |  "reliabilityExponent"=>15,                           |
   |         |  "mtbf"=>1600000,                                     |
   |         |  "l2CacheCapacity"=>0,                                |  
   |         |  "l1CacheCapacity"=>16777216,                         |
   |         |  "isSsd"=>0,                                          |   
   |         |  "ssdUuid"=>"52bbb266-3a4e-f93a-9a2c-9a91c066a31e",   |
   |         |  "volumeName"=>"NA",                                  |
   |         |  "formatVersion"=>"3",                                |
   |         |  "devName"=>"naa.600508b1001c5c0b1ac1fac2ff96c2b2:2", | 
   |         |  "ssdCapacity"=>0,                                    |
   |         |  "rdtMuxGroup"=>80011761497760,                       |
   |         |  "isAllFlash"=>0,                                     |
   |         |  "maxComponents"=>47661,                              |
   |         |  "logicalCapacity"=>0,                                |
   |         |  "physDiskCapacity"=>0,                               |
   |         |  "dedupScope"=>0}                                     |
---+---------+-------------------------------------------------------+
>

In the devName field above, you now have the NAA id (the SCSI id) of the disk.

I’ve requested that this information get added to the KB article.

Published by Cormac

View all posts by Cormac

22 Replies to “Component Metadata Health – Locating Problematic Disk”

Chip says:

March 18, 2016 at 2:07 pm

Wouldn’t it be even MORE helpful if the vsan.cmmds_find command added a field which returned the host on which the component was located? For customers that have large VSAN clusters with multiple disk groups containing several disks each, a certain amount of search would still be required to match the NAA disk ID to the appropriate host. Having any RVC commands which find and return information on components or objects that also state the containing host would be much simpler and more valuable, I think.
1. Cormac says:
  
  March 18, 2016 at 2:34 pm
  
  Having it in the UI would be even easier, which is the feature request I’m filing. But good point on the RVC command too. I’ll add that
Pingback: Newsletter: March 19, 2016 | Notes from MWhite
Boom says:

March 30, 2016 at 5:23 pm

We’re experiencing this problem. Unfortunately, using vsan.cmmds_find to locate the UUID yields no results yet the Health plug-in still showes the failed components. I’ve had an SR open with VMware for a couple weeks. No progress.
1. HamR says:
  
  April 5, 2016 at 12:24 am
  
  Any update on this one Boom? I have the same issue.
  1. Cormac says:
    
    April 5, 2016 at 1:51 pm
    
    Please open SRs for these issues with VMware support. They have tools where they can validate whether this is an actual issue or not, or if it is cosmetic, once they have the logs.
    1. HamR says:
      
      April 6, 2016 at 6:29 am
      
      In my case it was deemed to be cosmetic. The object was invalid because it didn’t exist.
      1. Boom says:
        
        April 6, 2016 at 4:16 pm
        
        How were you able to remove them?
  2. Boom says:
    
    April 5, 2016 at 6:58 pm
    
    VMware states that in our case the component objects are stale. We’re still working with them to get them deleted.
Boom says:

April 11, 2016 at 4:03 pm

We upgraded our VSAN to 6.2 over the weekend. It looks like during the upgrade the invalid objects were removed.
Marc says:

May 26, 2016 at 2:34 pm

I’m having an issue after a rebalance on a vSAN 6.2 cluster. I have one disk on a host failing the metadata health. The notes field in the UI shows “Ops backlog (Prepares=0, Commits=11210). The disk shows no faults and appears healthy from the server’s iDRAC/Perc information as well as when running the commands listed about. Any thoughts?
1. Cormac says:
  
  May 30, 2016 at 9:29 am
  
  I think the rebalance activity is “overloading” the disk. Wait for the rebalance activity to complete, rerun the health check and see if the issue persists. I suspect that it will go away once rebalance is completed.
  1. Marc says:
    
    May 31, 2016 at 1:04 pm
    
    Unfortunately waiting for the re-balance and then re-testing still shows the error. I’ve actually ran multiple re-balances and re-tested over the week with the same result. I started an SR with VMware and they too reported all the disks were healthy the the alert should be ignored. Hopefully this will be patched.
    
    I have noticed that the ” Ops backlog (Prepares=0, Commits = 11212) has gone up in value over time.
Anny says:

May 27, 2016 at 3:00 am

Stupid. 2016 and you have do all this just to identify a disk? C’mon!
1. Cormac says:
  
  May 27, 2016 at 8:13 am
  
  Already provided this feedback – but thanks for your thoughts Anny. I think we’re in agreement.
Pingback: VSAN Cormac Blog 〜障害が発生したディスクのコンポーネントのメタデータ健全性〜 - Japan Cloud Infrastructure Blog - VMware Blogs
Wouter says:

June 7, 2016 at 9:15 am

When I perform storage migration or clone actions on vSAN, I receive the following messages: Physical disk: Component metadata health: Failed.
Notes: Ops backlog.
It looks like when vSAN is under write load these messages will appear randomly. read from vSAN is fast but the writes are so slow that migration and clone actions are failling.

Current setup:
Force 10: S4810p

3x Nodes:
2x 10Gbit Intel NIC (1 in use for vSAN)
4x Flash (1x cache en 3x capacity)
1x Raid Controller Perc H730 in HBA mode (firmware: 25.4.0017)
ESXi driver: lsi_mr3 version 6.903.85.00-1 (async driver)
ESXi changed settings:
esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

vSAN network MTU: 9000
Hardware compatibility: passed

I just opened an SR at VMware.
1. Cormac says:
  
  June 7, 2016 at 9:27 am
  
  Please update us with the results of the investigation.
  1. David says:
    
    June 14, 2016 at 2:20 pm
    
    Pretty similar setup and same issue here. Updates are very appreciated. 😉
    1. David says:
      
      June 15, 2016 at 9:57 am
      
      VMware support says:
      
      ‘This message “Ops backlog (Prepares = 0, Commits = 11871)” means that 11871 operations are held in SSD and should be moved at some point to capacity drives,
      in VSAN Health threshold is set to 10000 so that is why you got this error
      this error can show to inability of capacity layer to cope this load, but usually and here that happens in new installations with big SSD cache and just few active VMs
      so please deploy more active VMs and let VSAN work for real ;)’
      
      We are trying that at the moment.
      1. David says:
        
        June 15, 2016 at 2:04 pm
        
        It worked. Took some time but then the metadata health went to green.
Boom says:

June 15, 2016 at 6:55 pm

what did you change to fix the metadata health alerts?

Comments are closed.