VSAN Part 32 – Datastore capacity not adding up

Cormac

10 years ago

I was involved in an interesting case recently. It was interesting because the customer was running an 8 node cluster, 4 disk groups per host and 5 x ~900GB hard disks per disk group which should have provided somewhere in the region of 150TB of storage capacity (with a little overhead for metadata). But after some maintenance tasks, the customer was seeing only 100TB approximately on the VSAN datastore.

This was a little strange since the VSAN status in the vSphere web client was showing all 160 disks claimed by VSAN, yet the capacity of the VSAN datastore did not reflect this. So what could cause this behaviour?

After some investigation, it became obvious that some of those disks were not contributing their storage in their respective disk groups. One of the main clues was the fact that when the disks were queried from the esxcli vsan storage list command, they were shown as no longer being part of CMMDS, VSAN’s cluster membership and monitoring directory service:

naa.600605b008b04b90ff0000a60a119dd3:
  Device: naa.600605b008b04b90ff0000a60a119dd3
  Display Name: naa.600605b008b04b90ff0000a60a119dd3
  Is SSD: false
  VSAN UUID: 520954bd-c07c-423c-8e42-ff33ca5c0a81
  VSAN Disk Group UUID: 52564730-8bc6-e442-2ab9-6de5b0043d87
  VSAN Disk Group Name: naa.600605b008b04b90ff0000a80a26f73f
  Used by this host: true
  In CMMDS: false
  Checksum: 15088448381607538692
  Checksum OK: true

This explained why the capacity was showing up incorrectly, but it did not explain why VSAN was unable to use the capacity of these disks for the VSAN datastore.

After some additional research, we found that the underlying volumes on the disks were seen as “snapshots” by the ESXi host. This can be verified using the following command:

~ # esxcli storage vmfs snapshot list
   54228778-891c4b60-a013-000c29fe01fa
   Volume Name: test-demo
   VMFS UUID: 54228778-891c4b60-a013-000c29fe01fa
   Can mount: true
   Reason for un-mountability: 
   Can resignature: true
   Reason for non-resignaturability: 
   Unresolved Extent Count: 1
~ #

This behaviour can happen for a number of reasons, and is not specific to VSAN. In the past we have seen this issue (local VMFS volumes being reported as snapshots) when customers upgraded controller firmware or replaced storage controllers on the host. When the volume is seen as a snapshot, it will not be mounted by ESXi.

In this particular scenario, the disks were present and correct, but the volumes were not mounted, implying that they could not be included in capacity calculations.

Later on in this case it was discovered that the maintenance activity at the customer site involved a number of changes to the servers, including the replacement of a motherboard in one of the servers.

Once this root cause was confirmed, the volumes were mounted by using the command esxcli storage vmfs snapshot mount -u . This resolved the issue, allowed the volumes to be mounted and brought the VSAN datastore back to full capacity.

If you find discrepancies between the available physical capacity per host and the VSAN datastore capacity, check that all the disks status display In CMMDS: true. If any are shown in the false state, check if the ESXi host is mounting the volumes correctly (and not seeing the volumes as snapshots) using the above commands.

A KB article is being created to outline the correct steps to follow if this situation arises. in the meantime GSS can be contacted for further assistance.