VSAN 6.2 Part 8 – Upgrading VSAN Stretched Cluster from 6.1 to 6.2

Cormac

8 years ago

This is an exercise that we ran through in our lab environment, and we thought that the steps would be useful to share here. By way of introduction, our 4 node cluster is split into a 2+2+1 configuration, where there are 2 ESXi hosts on site A (VLAN 4), 2 ESXi hosts on site B (VLAN 3), and a third site, site C (VLAN 80), hosting the witness appliance (nested ESXi host). All sites are connected over L3. In other words, static routes are added to each of the ESXi hosts so that ESXi hosts on site A can reach the ESXi hosts on site B and the witness on site C. Similarly, static routes are added to the ESXi hosts on site B so that they can communicate to the ESXi hosts on site A and the witness on site C. Finally, the witness has static routes so that it can communicate to the ESXi hosts on site A and site B. The vCenter Server managing this cluster is running on a separate management cluster and does not reside on the stretched cluster.

In a nutshell, there are the tasks involved to avoid any downtime in the stretched cluster:

Upgrade vCenter Server to 6.0U2 (VSAN 6.2)
Upgrade Site A ESXi host 1 to ESXi 6.0U2 (VSAN 6.2)
Upgrade Site A ESXi host 2 to ESXi 6.0U2 (VSAN 6.2)
Upgrade Site B ESXi host 1 to ESXi 6.0U2 (VSAN 6.2)
Upgrade Site B ESXi host 2 to ESXi 6.0U2 (VSAN 6.2)
Upgrade Site C Witness appliance to ESXi 6.0U2 (VSAN 6.2)
Perform rolling upgrade of on-disk format across all hosts

Let’s look at these steps in more detail.

1. Upgrade vCenter Server

This is pretty straight forward and since we typically using the vCenter appliance, we tend to use the VAMI for upgrades. This is located at the http://<vcenter-ip-address>:5480. Login with root credentials, and the select the Update view:

Click on Settings, then change the repository URL to the one you use, click OK and then click Install Updates. I also tend to monitor the progress via a shell session, and running:

tail –f /storage/log/vmware/applmgmt/software-packages.log

After the upgrade, the vCenter server will need to be rebooted. The first step is now completed. Be aware that the health check will now be goosed, as the vCenter health check version no longer matched the ESXi server version:

Refer to the detail vSphere documentation for further info (or if you wish to get the steps for a Windows version of vCenter Server). Next, we need to upgrade the ESXi hosts next.

2. Upgrade ESXi hosts

Since this is a stretched cluster, HA and DRS will be enabled. I have DRS in fully automated mode, so that when a host is placed into maintenance mode, the VM’s are automatically migrated to remaining hosts in the same site (since we have affinity rules set up to do that as part of the stretched cluster configuration).

The other consideration is whether or not you fully evacuate all the components from the hosts, or whether you choose the “ensure accessibility” approach. The main consideration is the read locality behaviour. VMs in stretched cluster are designed to read from the local image. If “ensure accessibility” is chosen, it might mean that the VM may be reading from the image on the remote site while the upgrade it taking place (and of course there is now only one copy of the data available, so greater risk should another failure occur at the remote site during the upgrade).

Of course, if full data evacuation is chosen, then you need to make sure that there is enough free space on the remaining hosts on the local site to accommodate the evacuated components. Now if you follow best practice recommendations for vSphere HA and keep 50% of capacity free for failover, you should be able to do this. But once again consider whether or not you are happy to incur extra VM I/O latency for the duration of the upgrade (while doing the upgrade as quickly as possible) or is ensuring that there is no risk to the VMs while upgrading is more desirable.

In my environment, I assumed that I was doing this during a maintenance window, so I used the “ensure accessibility” maintenance mode option, incurring additional latency in some cases, but getting the upgrade done asap. The steps were:

Place host in maintenance mode – ensure accessibility
Wait for DRS to migrate the VMs to other hosts on the same site
Upgrade the ESXi host from 6.0U1 to 6.0 U2 – during this time, host connection / power state errors are to be expected as well as a warning about the vSphere HA agent
Reboot the host
Exit Maintenance
Rinse and repeat for all hosts on site A
Rinse and repeat for all hosts on site B

At this point all hosts are running VSAN 6.2 except for the witness. The health check still isn’t working as the witness appliance is still not upgraded. The health check reports that as follows:

Let’s take care of this next.

3. Upgrade Witness Appliance

Caution: Note that this is upgrading the witness appliance. You must upgrade it just like you upgraded the physical ESXi hosts. You cannot simply deploy a new VSAN 6.2 witness appliance at this point, because the on-disk formats are not compatible. Instead you must upgrade the ESXi version of the witness appliance using normal upgrade tools.

Since there are no VMs to migrate, and considering that no other host can contain the witness components, maintenance mode here is optional in my opinion. Simply upgrade the ESXi version and reboot the witness appliance.

Before commencing the upgrade of the witness, you should see the status of the objects as follows:

After the witness has been upgraded, the objects should look as follows:

And now the health check should be working as well:

And now we have a warning about the on-disk format being out-of-date. Let’s take care of that next. Note that upgrading of the witness appliance took about 20 minutes in my tests.

4. On-disk format upgrade

This is the final step in the upgrade process. If we wish to leverage some of the new features in VSAN 6.2, such as deduplication, compression and checksum, we need to upgrade the on-disk format. If we look at the VSAN > Manage > General View, we can see that there is a warning about the on-disk format and the option to upgrade it.

If the upgrade button is clicked, VSAN will perform a rolling upgrade of the disk groups. This involves:

A complete evacuation of all components from a disk group to other disk groups in the same site.
Removing all the disks from the disk group.
Recreating the disk group and adding the disks back in, now with the new on-disk format
Rinse and repeat for all ESXi hosts on site A
Rinse and repeat for all ESXi hosts on site B
Note that since there is no place to evacuate the witness components, and seeing as the witness appliance on-disk format must also be upgraded, the witness components are deleted and recreated. Since these components are quite small, this step is relatively quick.

The process can be tracked by viewing the tasks on the data center. Note that since the witness appliance is not part of the cluster, you will not see those tasks in the cluster task view.

When this process is completed, the upgrade to VSAN 6.2 is also completed.

Note that with my environment, which is a All-flash VSAN configuration (apart from the witness which was deployed on a hybrid VSAN) the on-disk upgrade took the best part of 1 hour. There were only 5 VMs on each site however, so mileage may vary.

6. Some notes on support and interoperability

In this section, I just wanted to highlight a few interop and support issues with stretched cluster. Obviously VSAN 6.2 introduces some new features such as RAID-5/RAID-6 objects, deduplication, compression and checksum. So are these supported with VSAN 6.2 stretched cluster?

RAID-5/RAID-6: Not supported. Only RAID-1 configurations are supported for tolerating failures in a VSAN stretched cluster configuration
Checksum: Supported, and enabled on virtual machines by default when the on-disk format is upgraded to V3
Deduplication/Compression: Supported, but disabled by default. If this is enabled on the stretched cluster, another rolling upgrade is required as per step 5 above. Note however that the cluster must be an All-flash VSAN stretched cluster. These features are not supported on hybrid.
Deduplication/Compression + Checksum: Supported

One other item that I want to mention is something which I am really happy about. I wrote about the split brain situation here, and the manual steps administrators would have to go through to clean up ghost VMs when both sites were no longer able to communicate. We actually supplied a script in VMware Knowledge Base Article 2135952 to help you with this. In VSAN 6.2, this is no longer necessary and we actually clean up after failover, which is great news. A snippet of the failover tasks is shown below.

7. Stretched Cluster Health Check

Don’t forgot to verify that everything is working at the end by running a test on the VSAN health check. There is a section of the health check that is dedicated to VSAN stretched cluster and checks lots of things such as witness functionality and fault domain setup, as shown below.

At this point, you’ve successfully upgraded your VSAN 6.1 stretched cluster to 6.2. Well done!