VSAN 6.2 Upgrade – Failed to realign objects

A number of customers have reported experiencing difficulty when attempting to upgrade the on-disk format on VSAN 6.2. The upgrade to vSphere 6.0u2 goes absolutely fine; it is only when they try to upgrade the on-disk format, to use new features such as Software Checksum, and Deduplication and Compression, that they encounter this error. Here is a sample screenshot of the sort of error that is thrown by VSAN:

locked or lack of vmdk descriptor fileOne thing I do wish to call out – administrators must use the VSAN UI to upgrade the on-disk format. Do not simply evacuate a disk group, remove it and recreate it. This will cause a mismatch between previous disk group versions (v2) and the new disk group versions that you just created (V3). Use the UI for the on-disk format upgrade.

What is the upgrade doing?

The first thing to note is that there are alignment steps involved in upgrading to the new on-disk format. This is to prepare the objects for new VSAN features. The first step is the realignment of all objects to a 1MB address space. The next step is to align all vsanSparse objects to a 4KB boundary. This will bring all objects to version 2.5 (an interim version) and readies them for the on-disk format upgrade to V3. This is why you cannot simply remove and recreate the disk groups as mentioned earlier. You cannot have aligned and misaligned objects co-existing.

What is the problem – why can VSAN not align the objects?

There are three issues that you might encounter.

The first is a stranded object on the VSAN datastore that no longer has an association. This could commonly be due to (a) VM swap object that got left behind due to the VM being removed when the cluster was in the middle of a maintenance or a failure event  or (b) administrators using “rm” on the VSAN datastore to remove the descriptor file or home namespace of a VM. This “rm” command does not remove the actual VMDK object, only the association, so this stranded object still is sitting on the VSAN datastore. Another variation of (b) is (c) where you might have removed hosts from a previous cluster, and created a new cluster with the hosts. This might have led to a UUID mismatch if not done correctly and left the VMs as inaccessible, including the VMDK. If you did not care about these VMs (e.g. you might have been conducting a POC or just “kicking the tires” of VSAN), you may have simply removed them from the inventory. This would have left the VMDKs stranded once more.

Note that the stranded VM swap object is not a new issue. We had this when upgrading from VSAN 5.5 to 6.0, and there is already an RVC command (vsan.purge_inaccessible_vswp_objects) to help you clear these. More details can be found in the upgrade section of the VSAN 6.0 Troubleshooting Reference Manual.

The second issue is a locked change block tracking (CBT) file left in the VM’s home namespace. The change block tracking issue relates to the ctk files that are used for CBT. This ctk file is used for tracking changes made to the VM since the last backup or replication. Now if you have a VM with this ctk, and the VM is vMotion’ed to another host in the cluster, a lock is left on the file. This lock is what prevents to the on-disk format upgrade.

The third issue is a broken snapshot disk chain. Again, this is similar to the missing disk descriptor file making objects inaccessible, but in this case, it is one of the delta VMDK descriptors in the chain that is inadvertently deleted, breaking the chain and leaving the delta object stranded.

What is the symptom?

There will be a General VSAN Error, as shown above, with the following description:

Failed to realign following Virtual SAN objects: 1e58f256-44f6-0201-3f17-02001823f4e0, f44ff256-11e1-1c0a-a06a-020018428a28, 8959f256-7ab7-f01b-913e-02001823f4e0, db50f256-29a3-2f2f-2404-02001823f4e0, 2358f256-9d3c-8b66-af4b-02001823f4e0, 1c58f256-bf10-b769-fb61-020018428a28, 7559f256-4da3-e787-ecbb-02001823f4e0, e850f256-10e0-eb8d-cac9-020018c78c91, 5858f256-4251-d5aa-d008-02001823f4e0, 0a51f256-f9de-3db4-c75b-020018428a28, 1e58f256-34fc-1cb7-f8aa-020018428a28, dd50f256-fb57-80c0-8894-020018d2303b, ec50f256-9792-1ddb-d757-020018428a28, 534ff256-4b1c-05e5-1092-020018428a28, de57f256-0c14-8bf0-635b-020018c78c91, 1658f256-3ae8-fcf7-a9a3-020018d2303b, due to being locked or lack of vmdk descriptor file, which requires manual fix.

How can it be fixed?

VMware will be providing a permanent fix in due course. In the meantime we are providing a  script that will detect stranded objects, broken disk chains and objects with a CBT lock. Where possible, the script will then ask whether you want the issue to be fixed for you, align the objects, and bring everything to interim version 2.5. This will then allow the upgrade of the on-disk format to proceed to V3. All steps (as well as the script) can be found in the KB articles listed below.

For details on the stranded objects, please refer to KB article 2144881. In the case of VM Swap objects, these will be discarded as they are of no use and contain no useful data. In this case of the stranded VMDKs, the descriptors will be recreated and placed in a lost+found folder in the VSAN datastore. Administrators can then create a temporary VM, attach the VMDKs, examine the data and decide if they still need these VMDKs or not.

For details on the CBT issue, please refer to KB article 2144882. For this, we use an alternate “internal” method of accessing the file with the lock, allowing it to be aligned.

Broken disk chains are a little more complicated, and cannot be fixed without getting a human involved. KB article 1004232 will guide you through the steps of rebuilding the chain.

Although this is a bit of a drag, on the plus side you will now have a lot of reclaimed stranded space after the procedure.

What is this issue not related to?

A number of customers have reported this issue as due to issues with App Volumes, vCloud Director, Linked Clones, snapshots, etc. In our testing, these have all pointed back to the issue with a broken chain, a CBT lock or the un-associated object.  We now have a mechanism to detect these objects, and steps on how to clean them up.

What if I still have issues?

If the on-disk format upgrade fails, and you cannot address it with the procedures described in the KB articles provided here, please contact GSS, our support folks.

32 comments
    • No – there is only a command to monitor it afaik – vsan.upgrade_status.

      [Update] Sorry, I stand corrected – there is an RVC command to upgrade: vsan.ondisk_upgrade If you do not have enough resources for evacuations, you can use this command with the “–allow-reduced-redundancy” option

  1. One thing that came to mind was pointing out whether VMs are affected by this in regard to availability. In other words, if the upgrade fails does it impact running VMs?

    • Nope – the step in the upgrade that has issues is simply a realignment of components making up the VM objects. VMs will stay running at all times, and there is no impact if the upgrade step cannot align the components. This simply means that the upgrade process cannot move onto the next step which is to change the on-disk format to V3. In fact, there is no evacuation or migration of components in this step either. That only happens with the on-disk format step, which is next.

  2. Is there something we can do to ‘predict’ if we are going to have an issue? Or is the recommendation to WAIT to upgrade to on-disk v3 until an esxi patch is issued?

    • The decision is yours, but you are certainly fully supported with on-disk format V2 with vSphere 6.0U2. There will be annoying health check warnings about the on-disk format not being V3.

      However, the patch will basically do what the script is doing. I do not have a data for this however. It also means that if you want to use new features like dedupe/compression/checksum, you will not be able to until the on-disk format is at V3.

      I think the long term goal is to get this into the health check, so that you can be automatically told about stranded objects, and prompted to clean them up.

  3. Can the script be used as pre-upgrade check? Or that something you can use only after upgrading?

    • Good question. I would need to try out whether it will report when the on-disk format is at V2, or if it needs the upgrade attempted which brings the on-disk format to the interim version of 2.5 before it reports on anything useful. Note that even if you are on on-disk format V2.5, this will not cause any issues for the running VMs.

  4. Hi Cormac, we have a cluster of 6.0u1 VC + ESX and joined a single 6.0u2 host to it – but when creating disk groups on the new host it automatically formats to the new v3+ format which we don’t want (yet) – do you know if there is a way to force it to create Virsto/v2 vSAN disk format using an adv host setting perhaps?

    Alternative is I guess we’d need to rebuild the host without the u2 patches 🙂

  5. Well, updating the ondisk-format fails with “Upgrade tool stopped due to error, please address reported issue and re-run the tool again to finish upgrade.”

    So far, so good. Copied the VsanRealign.py-script to one esxi-hosts, and executing “precheck”. However, this fails with a not listed errormessage:

    Object UUID: 66865856-ffe9-f791-95f5-a0369f709828
    Recorded Path: /vmfs/volumes/vsan:52637182b834474d-90e47869464cba06/58865856-356e-a535-5cee-a0369f709828/IPVA TZA K100000_1.vmdk
    Recorded VM: IPVA TZA K100000
    Errors:
    2016-04-14T09:55:07.950Z VsanSparseRealign: GetExtents: DiskLib_Open() failed Disk encoding error (61)
    2016-04-14T09:55:07.950Z VsanSparseRealign: Error Closing handle: Disk encoding error (61)

    That message repeats for all other VMs.

    How can this be fixed?

    • Please ensure you are using the latest version of the script. The KB was updated recently. If the problem still exists, please contact support.

      Just one question I’d like to ask. Are you using any product, such as vCloud Director, Horizon View or App Volumes in your environment?

      • Thanks for your answer,

        however the script looks uptodate for me. I downloaded the file from KB 2144881, and KB 2144882. Filenames differ, but the script itself is identical and dated april 11th.

        Running a second time with “fixcbt” now, since one VM was reported for that issue. Maybe it was a temporary problem only.

        The “Disk encoding error (61)” was reported for 56 of the 64 VMs on that cluster, so nearly every single VM, while the VMs themselves seems to run fine.

        We also do not run the named products. Only the vcenter-appliance, the orchestrator-appliance, the vmeter-appliance, and the backup-appliance (which is shutdown currently, because of issues with the new ESXi-version in general).

        • Coming back to this issue:

          running the script with “fixcbt” seems to have the one VM fixed, but leaving the other 56 broken with die “Disk encoding error”.

          However, I would suspect, that the python-script is to blame here and maybe has no support for this particular type of VM. After crosschecking I can see, that these 56 VMs are all of the same type: an old ESXi 4 VM (VM-Version 7) with a buslogic-Controller. Maybe the script does not know about this particular version of VMs?

          Guess I will open a support ticket then, as long as you do not have a better idea.

          Regards,
          Matthias

    • Any other suggestions on this one. Getting the same issue and error, however we do have some app volumes in this env. So this is the exact error.

      Object UUID: 533f9854-4427-d5b2-c64b-a0369f3fe000
      Recorded Path: /vmfs/volumes/vsan:52a3194a8f84f73f-949842cb3b8f2d89/cloudvolumes/apps/systemcenter/System!20!Center!20!Apps.vmdk
      Recorded VM: cloudvolumes
      Errors:
      2016-04-20T14:06:03.268Z VsanSparseRealign: GetExtents: DiskLib_Open() failed Disk encoding error (61)
      2016-04-20T14:06:03.268Z VsanSparseRealign: Error Closing handle: Disk encoding error (61)

      Thank you,
      -Glenn

        • In our case we were told, that this error is a specific problem due to “encoding=windows-1252” in the *.vmdk-files. Unfortunately there seems to be no easy way to recover from this. A fix will be made in some future ESXi-patch, so I guess all you can do is to wait.

          Regards,
          Matthias

          • I opened my ticket, and was told the same thing it is the encoding=windows-1252 line messing up things in the VMDK, the fix is to copy the the vmdk, then edit the original, and change “encoding=windows-1252” to “UTF-8” I had to do this in about a dozen files and all worked fine after that. Was told about a future fix, but no timeline for it.

          • Thanks for the tip. I was having the same issue, DiskLib_Open() failed Disk encoding error (61) on older cloud appvols

            My method without copying the vmdk files:

            Enable ssh on one of the hosts in the vsan cluster.
            cd /vmfs/volumes/vsanDatastore/cloudvolumes/**folder with file issues*

            for example one of mine was cd /vmfs/volumes/vsanDatastore/cloudvolumes/writable_templates

            Use vi
            vi template_uia_plus_profile.vmdk
            type i (edit mode)
            change encoding=”windows-1252″ to encoding=”UTF-8″
            press esc
            type :qw (quit saving changes)

            type :q! (if you mess up and don’t want to save the changes)

            Do this for each file with issues with the (61) error.

            python VsanRealign.py precheck now shows no errors and I was able to start upgrading the disk version to 3

          • @Glenn: would you so kind and provide us with your supportcase-number in which you raised this issue?

            We have run into a deadlock with the VMware support unfortunately. They insist that the problem with the fileencoding is related to a hardware issue, because our harddisks are not listed on the HCL (sic).

            Thank you very much.

            Matthias

          • Hi Matthias – Here is the SR 16965940404, Engineer Fouad Sethna was excellent, he seemed to know generally about the issue and was using me as test to see if this resolved it for fix in a future release.

            Good luck
            -glenn

  6. Cool tool and very much appreciate the post. Right now, seeing something not covered in the post and comments, on the last hard disk out of a bunch on an appliance platform services controller:

    ——————————————————————-
    These objects encountered an unknown error during the scanning process
    ——————————————————————-
    Object UUID: 8e2d7655-9ff4-c6fc-5ea9-ecf4bbcb6048
    Recorded Path: /vmfs/volumes/vsan:52e2598d4f92602a-2f53dd9806d62e34/862d7655-b875-7b06-4e76-ecf4bbcb6048/sles11-64-2cpu-2gb-is-vcpsc1_10.vmdk
    Recorded VM: sles11-64-2cpu-2gb-is-vcpsc1
    Errors:
    2016-05-03T16:29:23.850Z VsanSparseRealign: Failed to initialize the disk library.

    Not sure what this means, going to run the script again to see what happens. Anyone with a thought? FWIW, we run VDP to backup this VM, but we also run VDP on others in the VSAN cluster with no issue.

      • Thanks for your quick response. It did not recur in subsequent runs, and I’m currently 55% format upgrade, so — so far so good. Maybe a transient issue with the disk being temporarily locked?

  7. I’m in a position where i can move all my Vms to some alternate storage. Is it easier for me to just delete the disks / disk groups and start again (turn VSAN off and on again in the cluster maybe?)

    • This is one of those “it depends” type of questions Paul. Many customers have been able to go through the upgrade process without issues. But going the route that you suggest is also an option, if you want to do it that way and avoid “potential” issues as highlighted here.

Comments are closed.