VSAN 6.2 Upgrade – Failed to realign objects

Cormac

8 years ago

A number of customers have reported experiencing difficulty when attempting to upgrade the on-disk format on VSAN 6.2. The upgrade to vSphere 6.0u2 goes absolutely fine; it is only when they try to upgrade the on-disk format, to use new features such as Software Checksum, and Deduplication and Compression, that they encounter this error. Here is a sample screenshot of the sort of error that is thrown by VSAN:

One thing I do wish to call out – administrators must use the VSAN UI to upgrade the on-disk format. Do not simply evacuate a disk group, remove it and recreate it. This will cause a mismatch between previous disk group versions (v2) and the new disk group versions that you just created (V3). Use the UI for the on-disk format upgrade.

What is the upgrade doing?

The first thing to note is that there are alignment steps involved in upgrading to the new on-disk format. This is to prepare the objects for new VSAN features. The first step is the realignment of all objects to a 1MB address space. The next step is to align all vsanSparse objects to a 4KB boundary. This will bring all objects to version 2.5 (an interim version) and readies them for the on-disk format upgrade to V3. This is why you cannot simply remove and recreate the disk groups as mentioned earlier. You cannot have aligned and misaligned objects co-existing.

What is the problem – why can VSAN not align the objects?

There are three issues that you might encounter.

The first is a stranded object on the VSAN datastore that no longer has an association. This could commonly be due to (a) VM swap object that got left behind due to the VM being removed when the cluster was in the middle of a maintenance or a failure event or (b) administrators using “rm” on the VSAN datastore to remove the descriptor file or home namespace of a VM. This “rm” command does not remove the actual VMDK object, only the association, so this stranded object still is sitting on the VSAN datastore. Another variation of (b) is (c) where you might have removed hosts from a previous cluster, and created a new cluster with the hosts. This might have led to a UUID mismatch if not done correctly and left the VMs as inaccessible, including the VMDK. If you did not care about these VMs (e.g. you might have been conducting a POC or just “kicking the tires” of VSAN), you may have simply removed them from the inventory. This would have left the VMDKs stranded once more.

Note that the stranded VM swap object is not a new issue. We had this when upgrading from VSAN 5.5 to 6.0, and there is already an RVC command (vsan.purge_inaccessible_vswp_objects) to help you clear these. More details can be found in the upgrade section of the VSAN 6.0 Troubleshooting Reference Manual.

The second issue is a locked change block tracking (CBT) file left in the VM’s home namespace. The change block tracking issue relates to the ctk files that are used for CBT. This ctk file is used for tracking changes made to the VM since the last backup or replication. Now if you have a VM with this ctk, and the VM is vMotion’ed to another host in the cluster, a lock is left on the file. This lock is what prevents to the on-disk format upgrade.

The third issue is a broken snapshot disk chain. Again, this is similar to the missing disk descriptor file making objects inaccessible, but in this case, it is one of the delta VMDK descriptors in the chain that is inadvertently deleted, breaking the chain and leaving the delta object stranded.

What is the symptom?

There will be a General VSAN Error, as shown above, with the following description:

Failed to realign following Virtual SAN objects: 1e58f256-44f6-0201-3f17-02001823f4e0, f44ff256-11e1-1c0a-a06a-020018428a28, 8959f256-7ab7-f01b-913e-02001823f4e0, db50f256-29a3-2f2f-2404-02001823f4e0, 2358f256-9d3c-8b66-af4b-02001823f4e0, 1c58f256-bf10-b769-fb61-020018428a28, 7559f256-4da3-e787-ecbb-02001823f4e0, e850f256-10e0-eb8d-cac9-020018c78c91, 5858f256-4251-d5aa-d008-02001823f4e0, 0a51f256-f9de-3db4-c75b-020018428a28, 1e58f256-34fc-1cb7-f8aa-020018428a28, dd50f256-fb57-80c0-8894-020018d2303b, ec50f256-9792-1ddb-d757-020018428a28, 534ff256-4b1c-05e5-1092-020018428a28, de57f256-0c14-8bf0-635b-020018c78c91, 1658f256-3ae8-fcf7-a9a3-020018d2303b, due to being locked or lack of vmdk descriptor file, which requires manual fix.

How can it be fixed?

VMware will be providing a permanent fix in due course. In the meantime we are providing a script that will detect stranded objects, broken disk chains and objects with a CBT lock. Where possible, the script will then ask whether you want the issue to be fixed for you, align the objects, and bring everything to interim version 2.5. This will then allow the upgrade of the on-disk format to proceed to V3. All steps (as well as the script) can be found in the KB articles listed below.

For details on the stranded objects, please refer to KB article 2144881. In the case of VM Swap objects, these will be discarded as they are of no use and contain no useful data. In this case of the stranded VMDKs, the descriptors will be recreated and placed in a lost+found folder in the VSAN datastore. Administrators can then create a temporary VM, attach the VMDKs, examine the data and decide if they still need these VMDKs or not.

For details on the CBT issue, please refer to KB article 2144882. For this, we use an alternate “internal” method of accessing the file with the lock, allowing it to be aligned.

Broken disk chains are a little more complicated, and cannot be fixed without getting a human involved. KB article 1004232 will guide you through the steps of rebuilding the chain.

Although this is a bit of a drag, on the plus side you will now have a lot of reclaimed stranded space after the procedure.

What is this issue not related to?

A number of customers have reported this issue as due to issues with App Volumes, vCloud Director, Linked Clones, snapshots, etc. In our testing, these have all pointed back to the issue with a broken chain, a CBT lock or the un-associated object. We now have a mechanism to detect these objects, and steps on how to clean them up.

What if I still have issues?

If the on-disk format upgrade fails, and you cannot address it with the procedures described in the KB articles provided here, please contact GSS, our support folks.