VSAN 6.2 Part 12 – VSAN 6.1 to 6.2 Upgrade Steps

Cormac

8 years ago

I’ve already written a few articles around this, notably on stretched cluster upgrades and on-disk format issues. In this post, I just wanted to run through the 3 distinct upgrade steps in a little more detail, and show you some useful commands that you can use to monitor the progress. In a nutshell, the steps are:

Upgrade vCenter Server to 6.0U2 (VSAN 6.2)
Upgrade ESXi hosts to ESXi 6.0U2 (VSAN 6.2)
Perform rolling upgrade of on-disk format from V2 to V3 across all hosts

Step 1: Upgrade vCenter Server

As I was using the vCenter Server Appliance, I used the VAMI interface to upgrade. I find this the simplest way of upgrading my VCSA. Simply point your browser to port 5480, login as root, select Updates, and then go to Check Updates => Check URL (assuming the vCenter Server has access to VMware.com):

From there, select Install Updates followed by selecting “Install all updates”:

The progressed on the update can be monitored both on the VAMI interface, or via the CLI. If you open a shell session to the VCSA, navigate to /var/log/vmware/applmgmt.

From there I typically run a tail command against software-packages.log.

mgmt-vc01:/var/log/vmware/applmgmt # tail -f software-packages.log

2016-03-30T08:50:34.090 [1998]DEBUG:vmware.vherd.base.software_update:WGET: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/647ee3fc-e6c6-4b06-9dc2-f295d12d135c/6.0.0.10000.latest/package-pool/ruby2.1-rubygem-mini_portile-0.6.2-27.9.x86_64.rpm

2016-03-30T08:50:35.090 [1998]DEBUG:vmware.vherd.base.software_update:WGET: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/647ee3fc-e6c6-4b06-9dc2-f295d12d135c/6.0.0.10000.latest/package-pool/vmware-tools-vmci-kmp-trace-9.8.1.0_3.0.76_0.11-5.sles11.x86_64.rpm

Once the upgrade is complete, the VCSA will need a reboot.

Important: Make sure to close the original browser and launch a new browser to login back in to the vSphere Web Client. You may also need to clear your browser’s cache if you have difficulty logging back in, or if the layout of the vSphere Web Client login page looks a little strange.

One item to note – the health check will no longer function at this point, due to a mismatch between the version of ESXi and vCenter Server. Unfortunately, this backward compatibility could not be maintained as we needed to introduce a new high quality SDK in VSAN 6.2. However, going forward, the objective will be to maintain backward compatibility. When you examine the health check, you will see something like this:

Once the ESXi hosts are upgraded to U2 (and note that all of them will need to be upgraded), the health check will start to function once more. Cool – that is vCenter Server upgraded. Now lets focus on the ESXi hosts.

Step 2: Upgrade ESXi

There are various ways of doing this. I chose to do it via the ESXCLI as I only had a few hosts. You need to put the ESXi hosts into maintenance mode and perform the upgrade, one host at a time. I chose full evacuation of all data as I did not know how long the upgrade would take. I used the following command:

# esxcli system maintenanceMode set -e True -m evacuateAllData

When host has entered mode, I pointed it to the VUM online depot and upgraded (again assuming the ESXi hosts had access to VMware.com).

# build_url="https://hostupdate.vmware.com/software/VUM/\
PRODUCTION/main/vmw-depot-index.xml"

To check the contents of the depot:

# esxcli software sources profile list --depot $build_url

To initiate the upgrade:

# esxcli software profile update --depot $build_url \
--profile ESXi-6.0.0-20160302001-standard --no-sig-check

Of course, all of this could easily be scripted if you had a number of hosts to upgrade.

Once the upgrade completed, reboot the host and exit maintenance mode. You can exit maintenance mode with the command:

# esxcli system maintenanceMode set -e False

Now rinse-and-repeat this operations on the remaining hosts in the cluster. While this operation is going on, both the UI or RVC can be used to monitor evacuation/rebuild activities. I tend to like the RVC command, which is available on the vCenter Server (swipe right to see more columns of the output).

/mgmt-vc01/mgmt-datacentre/computers> vsan.resync_dashboard -r 10 0
2016-03-30 10:43:03 +0000: Querying all VMs on VSAN ...
2016-03-30 10:43:03 +0000: Querying all objects in the system from esxi-hp-01.rainpole.com ...
2016-03-30 10:43:03 +0000: Got all the info, computing table ...
+--------------------------------------------------------------------------------------+-----------------+---------------+
| VM/Object                                                                            | Syncing objects | Bytes to sync |
+--------------------------------------------------------------------------------------+-----------------+---------------+
| vrops-01.rainpole.com                                                                | 1               |               |
|    [vsanDatastore] 3c3e2656-0482-2df7-fe44-a0369f56e350/vrops-01.rainpole.com_1.vmdk |                 | 14.10 GB      |
| vdp-01.rainpole.com                                                                  | 1               |               |
|    [vsanDatastore] 022d4f56-1ab1-2616-1b25-a0369f56deac/vdp-01.rainpole.com.vmdk     |                 | 71.11 GB      |
| por-jump-04                                                                          | 1               |               |
|    [vsanDatastore] 2bcf3956-146e-75f6-5aa1-a0369f56de98/por-jump-04.vmdk             |                 | 14.59 GB      |
+--------------------------------------------------------------------------------------+-----------------+---------------+
| Total                                                                                | 3               | 99.80 GB      |
+--------------------------------------------------------------------------------------+-----------------+---------------+

At this point everything is upgraded from a vSphere perspective. All of your VMs should still be running, and health check should only be complaining about the on-disk format being a V2 rather than V3. A warning will also be shown in the General VSAN View:

All 12 of my disks still have on-disk format V2. The recommendation is to upgrade them to V3. Now we are ready for the final step, the on-disk format upgrade.

Step 3: Upgrade on-disk format to V3

Before we embark on this step, I want to reiterate something: Do not simply evacuate a disk group, remove it and recreate it. This will cause a mismatch between previous disk group versions (v2) and the new disk group versions that you just created (V3). These cannot be used concurrently by VSAN. Administrators must use the UI for the on-disk format upgrade, and click on the Upgrade button as shown in the General VSAN View above.

There are a few sub-steps involved in the on-disk format upgrade. First, there is the realignment of all objects to a 1MB address space. Next, all vsanSparse objects (typically used by snapshots) are aligned to a 4KB boundary. This will bring all objects to version 2.5 (an interim version) and readies them for the on-disk format upgrade to V3. Finally, there is the evacuation of components from a disk groups, then the deletion of said disk group and finally the recreation of the disk group as a V3. This process is then repeated for each disk group in the cluster, until finally all disks are at V3.

While steps 1 and 2 are in progress, there is very little to see in the UI. In fact, it may appear that the upgrade task is stuck at 10% for the longest time while the alignment takes place. There is another very useful RVC command to monitor the progress called vsan.upgrade_status, which can be run at intervals of 60 seconds:

/mgmt-vc01/mgmt-datacentre/computers> vsan.upgrade_status 0 -r 60
2016-03-30 14:19:59 +0000: Showing upgrade status every 60 seconds. Ctrl + c to stop.
2016-03-30 14:19:59 +0000: No upgrade in progress
2016-03-30 14:20:00 +0000: 63 objects in which will need realignment process
2016-03-30 14:20:00 +0000: 0 objects with new alignment
2016-03-30 14:20:00 +0000: 0 objects ready for v3 features
.
.
2016-03-31 09:37:41 +0000: No upgrade in progress
2016-03-31 09:37:43 +0000: 31 objects in which will need realignment process
2016-03-31 09:37:43 +0000: 32 objects with new alignment
2016-03-31 09:37:43 +0000: 0 objects ready for v3 features
.
.
2016-03-31 09:38:43 +0000: No upgrade in progress
2016-03-31 09:40:47 +0000: 0 objects in which will need realignment process
2016-03-31 09:40:47 +0000: 63 objects with new alignment
2016-03-31 09:40:47 +0000: 0 objects ready for v3 features

Once all object are aligned, the same RVC command can then be used to monitor the evacuation/rebuild activity (swipe right once more to see additional columns):

2016-03-31 11:02:49 +0000: Upgrade in progress - 35%
2016-03-31 11:02:51 +0000: Updating objects to alignment
2016-03-31 11:02:51 +0000: 0 objects in which need realignment process
2016-03-31 11:02:51 +0000: 63 objects with new alignment
2016-03-31 11:02:51 +0000: 0 objects ready for v3 features
2016-03-31 11:02:51 +0000: Upgrade invovles resyncing objects at times, showing current resync progress
2016-03-31 11:02:51 +0000: Querying all VMs on VSAN ...
2016-03-31 11:02:51 +0000: Querying all objects in the system from esxi-hp-01.rainpole.com ...
2016-03-31 11:02:52 +0000: Got all the info, computing table ...
+--------------------------------------------------------------------------+-----------------+---------------+
| VM/Object                                                                | Syncing objects | Bytes to sync |
+--------------------------------------------------------------------------+-----------------+---------------+
| vsan-dev-01                                                              | 1               |               |
|    [vsanDatastore] a767ea56-e820-327a-ff67-a0369f56de98/vsan-dev-01.vmdk |                 | 31.75 GB      |
| vVNX                                                                     | 1               |               |
|    [vsanDatastore] 6b243656-1cdc-9207-a064-a0369f56de98/vVNX_1.vmdk      |                 | 3.37 GB       |
| ch-jump-02                                                               | 1               |               |
|    [vsanDatastore] 74fafb56-126b-2f41-af97-a0369f56e350/ch-jump-02.vmdk  |                 | 31.58 GB      |
| Unassociated                                                             | 1               |               |
|    d4265356-ce26-b22d-17f1-a0369f56e350                                  |                 | 10.48 GB      |
+--------------------------------------------------------------------------+-----------------+---------------+
| Total                                                                    | 4               | 77.18 GB      |
+--------------------------------------------------------------------------+-----------------+---------------+

Eventually, all disk groups should be upgraded to on-disk format V3:

2016-04-04 09:26:40 +0000: No upgrade in progress
2016-04-04 09:26:41 +0000: 0 objects in which will need realignment process
2016-04-04 09:26:41 +0000: 0 objects with new alignment
2016-04-04 09:26:41 +0000: 63 objects ready for v3 features

Important: If you do not have enough resources, for example, you have a 3 node cluster, then you are unable to evacuate the disk groups. There is simply not enough resources to maintain the protection of the VMs. If you attempt an upgrade on a cluster with insufficient resources, it will fail to upgrade with message similar to:

A general system error occurred: Failed to evacuate data for disk uuid <XXXXX> with error: Out of resources to complete the operation

In this case, as per the VSAN 6.2 release notes, customers will have to use the RVC command:

> vsan.ondisk_upgrade –allow-reduced-redundancy

This will allow you to upgrade the on-disk format but with reduced redundancy. In other words, you may be running with only one copy of the data while the on-disk format is upgraded. This is why we always recommend a minimum of 4 nodes for VSAN; it allows you to avoid any potential risks with maintenance or upgrades, and also allows VSAN to self-heal in the event of a failure.

There is another issue to bring to your attention. This is to do with issues aligning the objects before doing the upgrade. This may fail with message similar to:

Failed to realign following Virtual SAN objects: <list of UUIDs> due to being locked or lack of vmdk descriptor file, which requires manual fix

Failed to migrate vsanSparse objects on cluster

In this case, we have instructions on how to resolve this situation as per this earlier post.

That now completed the third step in upgrading Virtual SAN to version 6.2. You are now ready to use the new features that come with the new on-disk format, such as compression, deduplication and software checksum.