Disaster/Recovery (DR) of vCenter Operations Manager

I just spent a very useful week looking at how our customers might be able to protect vCenter Operations Manager (vCops) with VMware’s vSphere Replication (vR) and Site Recovery Manager (SRM) products. It was quite tricky to get this to work, if I’m perfectly honest, but that was the whole point of the exercise. What we learnt is being fed back to the various business units within VMware, to see if we can make this more intuitive and less complex to achieve, but if you are interested in knowing how to configure your DR infrastructure to protect vCops, please read on.

Before we start, you need to ensure that you are using vCops version 5.8.2. We spun our wheels trying to figure out why the vCops UI VM would not come online on the DR site, and it seems we ran into a bug with VMware Studio, which was in turn used to build vCops version 5.8.1. So latest and greatest version of vCops before we start. For completeness, ESXi, vCenter Server appliance, SRM & vR were all running the current latest builds (5.5) too.

 Step 1: Lab Configuration

To allow for failover of vCloud Suite components to be replicated and orchestrated, our lab configuration can be thought of as having an inner layer and an outer layer. The outer layer consists of a vCenter Server, SRM and vSphere Replication, deployed at both sites. The inner layer will consist of our production vCloud Suite components, in this first test case, a vCenter Server appliance and vCops. It can be thought of as something like this:

Lab DR SetupAs you can see, it is the inner vCenter/vCops configuration that we are interested in failing over. This inner vCenter server has visibility across sites. Now, this isn’t the only possible configuration. There may be other configurations that can be setup to allow for the DR of vCops. However the purpose of this exercise is to purely see if vCops can be protected, so having a single instance of vCops in Miami replicated to Amsterdam with a single vCenter server (which may also be failed over) is the simplest way of achieving this.

One other point to note is that we were using VSAN at both the source and destination sites. Since vSphere Replication is storage agnostic, you could have just as easily used local storage. However, we also want to put VSAN through it paces in this DR scenario. The inner vCenter server to which vCops is registered can see the VSAN clusters on both the Miami and Amsterdam datacenters.

Step 2: vCops Deployment Considerations on the PROD site

There are two items to call out with the initial vCops deployment which will make your life a lot easier when it comes to DR. The first is to make sure that vCops is using DNS rather than IP addresses. This way, when you failover to the DR site, you can continue to use the same fully qualified domain name (FQDN) of the UI VM rather than having to use a new IP address. This is done by following the instructions in KB article 2017835. After making the change, you will need to unregister vCops from vCenter if you have already registered it, and then re-register it once again using the FQDN. During the registration, you should also reference the Analytics VM through its DNS name (as shown below) even though the IP address is presented in the registration screen. If you leave it at the IP address, we have observed that you will once again have to register vCops with vCenter post failover but if you have the Analytics VM registered with its FQDN, then re-registration after failover is not necessary.

10. repair using DNSStep 3: DR Site Preparation

This is where things become a little bit tricky. Since neither SRM or vR understands the concept of a vApp, you will need to deploy vCops on the DR site. You do not need to configure it, or register it with a vCenter server, you simply need to deploy it to get a vApp construct on the DR site to which vCops can fail over to. Of course, to deploy vCops, you also need to have a valid IP pool, so ensure that the IP pool is correctly configured on both the PROD site and the DR site (each referencing the network settings for their appropriate site). Once the vApp has been deployed, and the Analytics and UI VMs have been correctly configured with networking for the DR site, the Analytics and UI VMs can be deleted, leaving behind the vApp.

TIP: Before deleting the vCops VMs, you can power up the vApp at the DR site, examine the vApp properties and values and verify that the values for the vApp properties fill in correctly. Here is an example of the sorts of properties and values you should see when the vCops vApp is powered on:

4. vapp vm propertiesOnce you have verified the values match the desired network setting for vCops at the DR site, you can power the vApp down once more and delete both VMs, leaving the vApp construct in place.

Step 4: Setup Replication

At this point, vCops is successfully deployed and running at the PROD site, and a vApp container with no VMs is deployed at the DR site. We can now go ahead and set up replication and the orchestration necessary for a successful failover of vCops. We will make the assumption that the sites are already paired, so it is simply a matter of selecting the individual VMs in the vCops container, selecting vSphere Replication from the drop down list, selecting a replication server on the DR site, selecting a destination datastore on the DR site, selecting a suitable/desired Recovery Point Objective (RPO) and point in time instances if required, and that’s it. When both VMs from the vCops vApp are replicating, you can now start into the SRM part of the setup.

TIP: When using vSphere Replication and SRM, you will only be able to select VMs to add to a Protection Group when they are replicating or are fully replicated.

Step 5: Mappings & Placeholder Datastore

In SRM, there are three mappings tabs; resources, network and folders. In each of these tabs, select a mapping between constructs on the PROD site and DR site. The most important of these for vCops is the vApp construct – ensure that the vApp on the PROD site is mapped to the vApp on the DR site under the Resource Mappings tab, as shown below:

resource mappingsYou can also go ahead and select a placeholder datastore on the DR site. Since we have VSAN on both sites, we used this as our placeholder.

Step 6: Protection Groups 

Once all the mappings are in place, it is time to build a Protection Group. In this protection group, we place both vCops VMs (Analytics & UI). The protection group type will be vSphere Replication and not Array based replication (ABR). ABR leverages native replication of storage arrays which we do not have in this setup. One thing to note with the vCops vApp is that it configures the VMs with CD/DVDs attached via datastore (no idea why), which means that they cannot be replicated in this state. Therefore the Protection Group creation workflow will throw an error “Unable to protect VM due to unresolved devices”. You will now have to go into each VM in the Protection Group, edit its properties and detach the CD/DVD drive. The Protection Group status should now turn OK.

TIP: Since this is a vApp property, even if you change the VM to use a different DVD/CD type other than datastore, rebooting the vApp will set them back to datastore and break your protection group. We know this is silly, so we’re trying to see if it can be changed.

Step 7: Recovery Plans

The next step is to create the recovery plan. During this step, you simply pick the Protection Group that you already created. One thing to note which might help with the vCops tests; during a test failover, we chose to bring our vCops vApp live onto the network, so instead of leaving the test VM network on auto (bubble network), we provided an actual port group in the Recovery Plan. This will actually allow us verify vCops vApp functionality after a test failover.

A few adjustments from the defaults are needed here. The firs thing to note is that No customization of the IP settings are needed. The VMs on the DR site will inherit their networking from the vApp on the DR site, which you’ve already setup. However there are some additional changes needed.

7.1 Analytics VM changes

The only change needed in the recovery plan of the Analytics VM is a prompt in the Pre-Power On steps. This will give you an opportunity to change the IP address of the DNS Records of your DNS records, as well as put in place the vApp VM variables of the vCops VMs on the DR site. More on this later.

7.2 UI VM changes

The UI VM has a few changes necessary in the Recovery Plan. The first is to put a dependency on the Analytics VM in the VM Dependencies section. This means that the UI VM will wait on the Analytics VM to be active before starting. Next, the VMware Tools timeout value should be increased to 30 minutes in Startup Action to allow the failed over vCops UI VM enough time to successfully restart. Finally, add a Power On Step to run a command on the Recovered VM which repairs the communication between the Analytics and UI VMs. The command is sudo -u admin /usr/local/bin/vcops-admin repair –-ipaddress and the timeout should be set to 10 minutes.

11. repair scriptStep 8: Test Failover – Things to do when Prompt reached…

Now everything is ready for our test failover. You will need to verify that the VMs have completed replicating of course. Now, when you initiate a test failover, the VMs on the DR site will be stood up, but this has no impact on the VMs running on the PROD site. This is partly the reason why we put the prompt/pause step in the Recovery Plan in step 7. When the Recovery Plan stops at this prompt step, the following actions need to be carried out.

  1. If this is indeed a test failover as opposed to an actual failover, edit the properties of the Analytics and UI VMs on the PROD site and remove their network connections. This will prevent any confusion with vCops on the DR site. This would not be necessary in an actual failover, since SRM will power down the VMs on the PROD site if it can still reach them.
  2. Change the DNS records for the Analytics & UI VMs to reflect the new IP addresses. We used an IPAM system called Infoblox to do this. This could also be scripted if your IPAM system provided an API.
  3. Verify that the vApp settings on the DR site, especially the start-up order and timeout values match the vApp on the PROD site. This should be only necessary the very first time you try to failover, and should remain in place for subsequent test failovers and actual failovers.
  4. Update the VM vApp variables. This is the most tedious part. The VM vApp variables do not get replicated by SRM/VR. Therefore you will need to edit both the Analytics VM’s vApp properties and the UI VM’s vApp properties on the DR site, and put these variables back in place. When they are back in place, and the vApp is powered on, the VMs will retrieve actual values from the vApp. But this part is a real pain, but I’m sure it could also be scripted, maybe via PowerCLI, and then called out at this point in the Recovery Plan. Again, this should be only necessary the very first time you try to failover, and should remain in place for subsequent test failovers and actual failovers.

These are the changes need on each VM’s vApp properties. Let’s start with the Analytics VM. Edit Settings and navigate to  Options > vApp Options. Change from Disabled to Enabled. This can be done either from the C# client or the web client. I’ll show a mixture of both here:

3. vApps optionsOnce Enabled, go to the OVF Settings, and select both ISO and VMware Tools in the OVF environment transport:

OVF EnvironmentNow click on Product, and add the following:

Product Name: VMware vCenter Operations Manager
Version: 5.8.2.0
Full Version: 5.8.2.0
Product URL: 
Vendor: VMware, Inc.
Vendor URL: http://www.vmware.com
Application URL: https://${vm.ip}/
It should look something like this:
OVF ProductAnd finally, there are two properties to add. Click on properties, and add two new properties, vm.vmname and vm.ip:
Label: vmname
Key/Class ID: vm
Key/ID: vmname
Type: string
Default Value: VM_2
User Configurable: 

Label: ip
Key/Class ID: vm
Key/ID: ip
Type: ${vami.ip0.VM_2}
 The UI VM (almost the same as the Analytics VM, except for the properties section. Use VM_1 in the default value and type fields instead of VM_2):
Label: vmname
Key/Class ID: vm
Key/ID: vmname
Type: string
Default Value: VM_1
User Configurable: 

Label: ip
Key/Class ID: vm
Key/ID: ip
Type: ${vami.ip0.VM_1}

 When they have been added, the should look something like this in the VM’s vApp Properties:

VM vApp propertiesTIP: This information is already available on the PROD site, so flip back and forth between the two sites and verify that what you are adding to the VM’s vApp properties is correct. Note that you cannot add this information to the VMs on the DR site when they are just placeholders. You must initiate the failover and bring the VMs online on the DR site before this information can be added, which is why we do it at the prompt as the VMs are available on the DR site at this point. Again, this is again something which lends itself to a call-out script, maybe PowerCLI.

 Step 9 – Resume Test Failover

Now that we have disconnected the VMs on the PROD site, updated our DNS records and enabled and set the VM’s vApp properties at the DR site, the failover can be resumed. Note that the above step for fixing up the vApp is needed only once. Subsequent failovers and failbacks should not need you to do anything at this point.

Once the failover is in progress, we opened the consoles on both the UI VM and Analytics VM at the DR site to monitor progress. All going well, the repair-script should run in a timely fashion, and both VMs should boot successfully. SRM should report a successful test failover:

6. Successful FailoverStep 10 – Post Test Failover Verification

At this point we are now in a position to test vCops at the DR site. Note that my vCenter server to which vCops is registered was not failed over in this attempt; it is still active on the PROD site, but I can still reach it from the DR site.

Also note that I did not configure a test/bubble network; instead, my test failover network is using an actual routable network, so that my vCops VMs are live. This is the reason I remove the uplinks from the PROD site vCops VMs. Finally, I am using DNS throughout, and do not need to re-register the vCops against my vCenter server after failover. If I was not using DNS as per the configuration steps called out at the very start, then we have observed that a re-registation is necessary.

Summary

This is certainly one way to provide DR for your vCops product. Although some of the initial configuration steps are indeed tedious (the building out of the VM’s vApp properties for one), they are one-off configuration steps and will not need to be repeated.

Special thanks go to Paudie O’Riordan and Brandon Gordon for their assistance with this endeavor.

Misc.

There is a KB article (2031891) which details how to set up vCops DR without a vApp construct at the DR site. Instead, it details how to run through this setup using a resource pool. The major difference with this approach is that since there is no vApp construct, IP customization is required on the VMs on the DR site. Therefore you will have to add IP customization in the SRM recovery plan. This increased the failover time (another reboot is required for IP customization), and also induces a lot of errors appear in the VMs boot sequence since they are “Unable to find the OVF environment“. There may also be additional considerations with this approach when it comes to future upgrades of vCops. I have asked our Management BU for some guidance around this, and whether that would be an acceptable workaround. I’ll let you know when I hear back. It would save on a lot of the VM’s vApp settings if that was the case, but there could be other ramifications if there is no vApp construct.

Also, with a fully configured DNS for vCops, we’re seeing UI & Analytics VMs successfully communicating after failover without the need to run the repair script. We’ll continue to validate this, and follow-up with you if it proves to be true. For now, keep the repair script in the recovery plan as it doesn’t do any harm to have it run anyway.

8 Replies to “Disaster/Recovery (DR) of vCenter Operations Manager”

  1. Great post Cormac. It was great working with you on this. There are a few comments I would like to add.

    * Since vCOps has a dependency on IP Pools and the portgroup is stored in the vApp container OVF properties, the portgroup used for a test and failover needs to be the same portgroup. Allowing SRM to use different portgroups for test and failover will break the vApp container and therefore impact recovery.

    * Repairing the vApp options on the 2 VMs can be done before performing the SRM test. When vApp options are configured on the SRM placeholder VM they are retained on the recovered VM during a test and failover. By making the changes to the placeholder VMs prior to the test, the recovery plan can run without any manual intervention.

    * Since this procedure results in a vCOps instance in the recovery site with an intact vApp container, the same process works for failback with very minimal configuration changes. All that is needed after the failback process is run is to update the repair script on the UI VM with the new IP address for the Analytics VM (step 7.2 above).

    1. Thanks for this Brandon.

      On the portgroup point, this would explain why a test failover using the auto/bubble network does not work, and why you need the UI and Analytics VM on the same portgroup/network even for a test failover. We saw this in testing.

      On the repair point, when we DNS across the board (via KB 2017835 and register Analytics VM with FQDN rather than IP), we didn’t need to do any repair. We even saw this with the IP customization method as opposed to the vApp container method. You would still need it if IP addresses are used.

      Thanks again.

  2. A few questions as my workplace is looking at deploying a large Horizon View instance and using vCOps for Horizon for regular troubleshooting and performance reporting, along with SRM for recovery:

    1) Why is it that SRM/vR still doesn’t understand the concept of a vApp? vApps have been around for several years (5?) if I remember correctly.

    2) Why is it necessary to redeploy a vApp for the simple purpose of recreating the vApp container/properties? That seems rather unintuitive. In my mind, the vSphere Client should have the ability to create a vApp from scratch the same as you can create a normal VM from scratch.

    Finally, a couple statements. Thank you for your post. I sincerely hope that VMware takes note of it.

    1. Very valid points. We understand the need for vApp support too in DR scenarios. And this is understood by the product management team.

Comments are closed.