DR of VMware vCenter Orchestrator

VCOOver the past month or so, I’ve been looking at disaster recovery of some of the vCloud Suite components. My experiences of using vSphere Replication and Site Recovery Manager to protect and recover vCenter Operations Manager in the event of a disaster can be found here and here. Now it was time to look at vCenter Orchestrator (vCO) to see if that could be protected and recovered.

In this configuration, I deployed vCO in HA mode, meaning that there were two vCenter Orchestrator servers, one running and one in standby mode. The database for vCO was an external SQL Server database, running in its own VM. So there were three VMs to protect in this setup.

I am not going to repeat the detailed steps to set up vSphere Replication (VR) or Site Recovery Manager (SRM). Needless to say all the required steps such as pairing the sites, mapping resources, creating the protection group and recovery plan were all necessary. One point to note however is that the SQL server database can use VSS (Volume Shadow Service) for application quiescing during failover. I used that setting in this test. vCO servers do not have any quiescing abilities.

Like before, I used a ‘swing’ vCenter server so that this could also be failed over with the vCO configuration.

When it came to the recovery plan, the “configure recovery” step was necessary on a per VM basis to do the IP customization of the VMs. I also added dependencies on the VMs, so that standby vCO server has a dependency on the running vCO server being ready, and the running vCO server had a dependency on the SQL server database VM being ready. I also added a prompt step to the SQL Server DB start-up which gave me an opportunity to modify the IPAM entries of my VMs and update them with DNS with new IP addresses on the recovery site. This is what the recovery plan looked like:

3.5 recovery-planI also ensured that when the vCO servers were being setup, the SQL server database reference was via fully qualified domain name (FQDN) and not IP address. I also ensure that FQDN of the vCenter server was used when connecting it from the vCO server:

9. Connect to vCenterI connected my vCO server to my vCenter server, and used the vCO client to verify that I could login and run some simple workflows. Everything worked as expected. It was now time to test the failover/DR scenario.

4.1. DRBoth my vCO servers and SQL DB were eventually brought up on my recovery site, once the IP customization had run, and the prompt to change the IPAM DNS entries for my VMs had been dismissed. At this point, my vCO environment was up and running with new IP addresses on the recovery site, so now was time to validate that everything was working as expected. First step was to log onto each of the vCO Configuration and ensure all was okay. (If you get some strange login messages from the client, check this out.) When I logged in, I saw that Network and Startup Options were both flagged:

7.1 VCO1 viewThe network issue was due to the fact that the original IP from the protected site was still configured. I manually changed this in the Network portion of the Configuration UI here by selecting the new IP address (IP address of vCO server on recovery site). After doing so, all configuration entries went into a green state. I repeated this process on the other vCO server in the HA pair, and it also went green.

However, I still could not login to either of the vCO servers via the client. This was because the Server Availability section had no nodes. At this point, I went to the Startup options and restarted the services. At this point both nodes appeared under Server Availability. The strange thing was that both servers showed as Running, whereas one should be Standby:

7.11 Cluster mode runningI eventually switched the nodes back into Standalone mode, and then flipped them back into cluster mode, and this seemed to resolve the issue:

8.1. standby mode backThis only happened on one occasion – on another failover attempt, the nodes came up in the correct state (one running, one standby). However you should definitely only have one vCO server in a Running state, so verify this and rectify it before proceeding.

At this point I was able to successfully log into the vCO client, and add the vCO server back to my vCenter server.

To conclude, you can certainly do DR of your vCO environment using vSphere Replication and SRM, even with vCO configured in HA mode and using an external SQL server database. You will have to resolve the vCO configuration networking however post failover, and possibly flick the vCO configuration out of and back into cluster mode.

Note that I also did this test with the vCO servers referencing the SQL server database using IP address instead of FQDN. In that case, I had to change the vCO Configuration database settings on both servers and update the IP address to the new IP address on the recovery site. Once I did that, vCO once again appeared to work. However you can avoid this step by using FQDN like I mentioned earlier.