VSAN 6.2 Part 8 – Upgrading VSAN Stretched Cluster from 6.1 to 6.2

vsan_stretch_graphic_v02_300dpi_01_square140This is an exercise that we ran through in our lab environment, and we thought that the steps would be useful to share here. By way of introduction, our 4 node cluster is split into a 2+2+1 configuration, where there are 2 ESXi hosts on site A (VLAN 4), 2 ESXi hosts on site B (VLAN 3), and a third site, site C (VLAN 80), hosting the witness appliance (nested ESXi host). All sites are connected over L3. In other words, static routes are added to each of the ESXi hosts so that ESXi hosts on site A can reach the ESXi hosts on site B and the witness on site C. Similarly, static routes are added to the ESXi hosts on site B so that they can communicate to the ESXi hosts on site A and the witness on site C. Finally, the witness has static routes so that it can communicate to the ESXi hosts on site A and site B. The vCenter Server managing this cluster is running on a separate management cluster and does not reside on the stretched cluster.

In a nutshell, there are the tasks involved to avoid any downtime in the stretched cluster:

  1. Upgrade vCenter Server to 6.0U2 (VSAN 6.2)
  2. Upgrade Site A ESXi host 1 to ESXi 6.0U2 (VSAN 6.2)
  3. Upgrade Site A ESXi host 2 to ESXi 6.0U2 (VSAN 6.2)
  4. Upgrade Site B ESXi host 1 to ESXi 6.0U2 (VSAN 6.2)
  5. Upgrade Site B ESXi host 2 to ESXi 6.0U2 (VSAN 6.2)
  6. Upgrade Site C Witness appliance to ESXi 6.0U2 (VSAN 6.2)
  7. Perform rolling upgrade of on-disk format across all hosts

Let’s look at these steps in more detail.

1. Upgrade vCenter Server

This is pretty straight forward and since we typically using the vCenter appliance, we tend to use the VAMI for upgrades. This is located at the http://<vcenter-ip-address>:5480. Login with root credentials, and the select the Update view:

VAMI upgradeClick on Settings, then change the repository URL to the one you use, click OK and then click Install Updates. I also tend to monitor the progress via a shell session, and running:

tail –f /storage/log/vmware/applmgmt/software-packages.log

After the upgrade, the vCenter server will need to be rebooted. The first step is now completed. Be aware that the health check will now be goosed, as the vCenter health check version no longer matched the ESXi server version:

initial health before esxi upgradeRefer to the detail vSphere documentation for further info (or if you wish to get the steps for a Windows version of vCenter Server). Next, we need to upgrade the ESXi hosts next.

2. Upgrade ESXi hosts

Since this is a stretched cluster, HA and DRS will be enabled. I have DRS in fully automated mode, so that when a host is placed into maintenance mode, the VM’s are automatically migrated to remaining hosts in the same site (since we have affinity rules set up to do that as part of the stretched cluster configuration).

The other consideration is whether or not you fully evacuate all the components from the hosts, or whether you choose the “ensure accessibility” approach. The main consideration is the read locality behaviour. VMs in stretched cluster are designed to read from the local image. If “ensure accessibility” is chosen, it might mean that the VM may be reading from the image on the remote site while the upgrade it taking place (and of course there is now only one copy of the data available, so greater risk should another failure occur at the remote site during the upgrade).

Of course, if full data evacuation is chosen, then you need to make sure that there is enough free space on the remaining hosts on the local site to accommodate the evacuated components. Now if you follow best practice recommendations for vSphere HA and keep 50% of capacity free for failover, you should be able to do this. But once again consider whether or not you are happy to incur extra VM I/O latency for the duration of the upgrade (while doing the upgrade as quickly as possible) or is ensuring that there is no risk to the VMs while upgrading is more desirable.

In my environment, I assumed that I was doing this during a maintenance window, so I used the “ensure accessibility” maintenance mode option, incurring additional latency in some cases, but getting the upgrade done asap. The steps were:

  1. Place host in maintenance mode – ensure accessibility
  2. Wait for DRS to migrate the VMs to other hosts on the same site
  3. Upgrade the ESXi host from 6.0U1 to 6.0 U2 – during this time, host connection / power state errors are to be expected as well as a warning about the vSphere HA agent
  4. Reboot the host
  5. Exit Maintenance
  6. Rinse and repeat for all hosts on site A
  7. Rinse and repeat for all hosts on site B

At this point all hosts are running VSAN 6.2 except for the witness. The health check still isn’t working as the witness appliance is still not upgraded. The health check reports that as follows:

out-of-date witness in healthLet’s take care of this next.

3. Upgrade Witness Appliance

Caution: Note that this is upgrading the witness appliance. You must upgrade it just like you upgraded the physical ESXi hosts. You cannot simply deploy a new VSAN 6.2 witness appliance at this point, because the on-disk formats are not compatible. Instead you must upgrade the ESXi version of the witness appliance using normal upgrade tools.

Since there are no VMs to migrate, and considering that no other host can contain the witness components, maintenance mode here is optional in my opinion. Simply upgrade the ESXi version and reboot the witness appliance.

Before commencing the upgrade of the witness, you should see the status of the objects as follows:

witness absent copyAfter the witness has been upgraded, the objects should look as follows:

0. vm comliantAnd now the health check should be working as well:

1. disk format warningAnd now we have a warning about the on-disk format being out-of-date. Let’s take care of that next. Note that upgrading of the witness appliance took about 20 minutes in my tests.

4. On-disk format upgrade

This is the final step in the upgrade process. If we wish to leverage some of the new features in VSAN 6.2, such as deduplication, compression and checksum, we need to upgrade the on-disk format. If we look at the VSAN > Manage > General View, we can see that there is a warning about the on-disk format and the option to upgrade it.

2. disk format - general viewIf the upgrade button is clicked, VSAN will perform a rolling upgrade of the disk groups. This involves:

  1. A complete evacuation of all components from a disk group to other disk groups in the same site.
  2. Removing all the disks from the disk group.
  3. Recreating the disk group and adding the disks back in, now with the new on-disk format
  4. Rinse and repeat for all ESXi hosts on site A
  5. Rinse and repeat for all ESXi hosts on site B
  6. Note that since there is no place to evacuate the witness components, and seeing as the witness appliance on-disk format must also be upgraded, the witness components are deleted and recreated. Since these components are quite small, this step is relatively quick.

The process can be tracked by viewing the tasks on the data center. Note that since the witness appliance is not part of the cluster, you will not see those tasks in the cluster task view.

When this process is completed, the upgrade to VSAN 6.2 is also completed.

Note that with my environment, which is a  All-flash VSAN configuration (apart from the witness which was deployed on a hybrid VSAN) the on-disk upgrade took the best part of 1 hour. There were only 5 VMs on each site however, so mileage may vary.

6. Some notes on support and interoperability

In this section, I just wanted to highlight a few interop and support issues with stretched cluster. Obviously VSAN 6.2 introduces some new features such as RAID-5/RAID-6 objects, deduplication, compression and checksum. So are these supported with VSAN 6.2 stretched cluster?

  • RAID-5/RAID-6: Not supported. Only RAID-1 configurations are supported for tolerating failures in a VSAN stretched cluster configuration
  • Checksum: Supported, and enabled on virtual machines by default when the on-disk format is upgraded to V3
  • Deduplication/Compression: Supported, but disabled by default. If this is enabled on the stretched cluster, another rolling upgrade is required as per step 5 above. Note however that the cluster must be an All-flash VSAN stretched cluster. These features are not supported on hybrid.
  • Deduplication/Compression + Checksum: Supported

One other item that I want to mention is something which I am really happy about. I wrote about the split brain situation here, and the manual steps administrators would have to go through to clean up ghost VMs when both sites were no longer able to communicate. We actually supplied a script in VMware Knowledge Base Article 2135952 to help you with this. In VSAN 6.2, this is no longer necessary and we actually clean up after failover, which is great news. A snippet of the failover tasks is shown below.

no-ghost-VMs7. Stretched Cluster Health Check

Don’t forgot to verify that everything is working at the end by running a test on the VSAN health check. There is a section of the health check that is dedicated to VSAN stretched cluster and checks lots of things such as witness functionality and fault domain setup, as shown below.

stretched-cluster-healthAt this point, you’ve successfully upgraded your VSAN 6.1 stretched cluster to 6.2. Well done!

12 comments
  1. Any comment from VMware on customers that purchased VSAN advanced or Horizon Enterprise (VSAN Advanced included) for stretched cluster capability and now being forced to purchase an entirely new SKU (VSAN Enterprise) to get stretched clustering with 6.2?

    • I’m pretty sure that existing customers with existing licenses and SKUs are being given special treatment/consideration. Contact your VMware account team for details.

  2. Hi Cormac,

    We are getting quite a bit of interest in Stretched Clusters, one concern I have is that you cannot have N+2.

    if you lose a complete site then you cannot lose another disk or node – is that correct and if it is will there be improvements in the future (i.e. local RAID 5 either within a host (min 1 node) or across hosts (min 4 nodes))?

    Many thanks
    Mark

  3. Hi Cormac, how about an update of the VSAN Stretched Cluster Guide and adding some information resp. clarification on the HA advanced setting das.respectVmHostSoftAffinityRules? Currently this advanced setting is not mentioned in there. My understanding is that if DRS is enabled, should rules will be respected by the fdm agent in an HA enabled vSphere cluster. As we just had some intense discussions about the das.respectVmHostSoftAffinityRules setting in general and also especially in regard to a VSAN Stretched Cluster, some clarification would be highly appreciated.
    For the VSAN Stretched Cluster Guide.next: add info for prospect VSAN Stretched Cluster customers with only a vSphere Standard license – cant use DRS, should configure this advanced setting definitely I guess :). And what about those with DRS license – adv. setting needed or not?
    Thanks!

    • The only settings required for a VSAN stretched cluster are covered in the guide Steffen. You should not need to add/modify any additional rules.

      I’ll highlight the DRS comment to the new owner of the SC guide.

  4. Hello Cormac, any news about cross connect ROBO nodes? I have a customer who wants to use VSAN ROBO for their 60 Remote Offices with cross connect without a 10GbE Switch

Comments are closed.