Introducing VMware Cloud Disaster Recovery

Cormac

4 years ago

At VMworld 2019, I had the pleasure of presenting our business unit’s Spotlight session with our GM, John Gilmartin (you can watch the complete recording here). One of the topics that generated a lot of interest was a low-cost Disaster Recovery (DR) service. A lot has happened in the past year but most notably was the acquisition of Datrium. Merging the original goal of a low-cost DR as a Service (DRaaS) solution alongside the smarts acquired from Datrium, we are now almost at the point where we are ready to deliver a new VMware Cloud Disaster Recovery service to our customers.

A closer look at VMware Cloud Disaster Recovery

So how does it work? Let’s assume that you have a number of virtual machines in your on-premises data center that need protecting from some sort of disaster occurring. In your production site, we install a DRaaS connector which is configured to connect to our VMware Cloud Disaster Recovery cloud-based service, and start protecting your VMs. Data from the virtual machines is replicated to our cloud-based Scale-Out File System, which gives us a cloud-efficient storage target. This is where the replicated copies of our VM’s data are stored. The management of VMware Cloud DR is achieved via a SaaS Orchestrator, which can be thought of as a SaaS Management Console for DR. Connecting the DRaaS connector on the on-premises vSphere environment to the SaaS Orchestrator allows you to do task such as register your vCenter Server, create protection groups of VMs, setup site pairings for mapping of networks, folder, resources pools, datastores, etc, and then of course run the DR plans for these protection groups. These could be Test plans or actual Failover plans, so yes, you can also run Test plans with VMware Cloud DR.

The interface is expected to look something like the following, which should be very easy to use for vSphere administrators:

In the event of a disaster, there are two options available for the compute recovery. VMware Cloud DR can either (a) spin up a brand new SDDC cluster in VMware Cloud on AWS on-demand, or (b) maintain a very small “pilot-light” SDDC cluster for faster recovery. The choice here boils down to faster recovery vs. cost of maintaining compute in the cloud. In either case, the replicated copies of the virtual machine disks that are maintained on the Scale-Out File System are initially presented via an “NFS live-mount” to the VMW on AWS SDDC hosts, and are instantly powered on once the SDDC is available and the NFS datastore is mounted to the hosts. At the same time, Storage vMotion starts to migrate the VM data to primary storage in the SDDC. When the data is fully migrated, these restored copies of the data are used instead of the “NFS live-mount” copies for optimal performance.

It should be noted that Test and Failover plans do not need to use the latest copy of the data, although they do by default. Each replication takes a snapshot copy of the data, so you can actually go back in time and use an earlier snapshot. This can be extremely useful in the case of ransomware, when you might need to go back a number of iterations to a point that is prior to when the event occurred. Test plans also have the choice between remaining on the “live-mount NFS”, or you can also decide that the test plan should re-hydrate the data on primary storage.

Another key feature is the “delta-based Failback“. Once your production site is ready to start running workloads again, VMware Cloud DR has the ability to figure out when the last replication was taken from the on-premises VM before failover, and uses that to determine how much data needs to be restored. So there is no need to do a full restore from the cloud. Instead, VMware Cloud DR does a comparison of what is on both the production site and the cloud, and only syncs back the delta of changes that have occurred since the failover. A very nice feature.

Here is a simple diagram to represent the above features and workflows:

Of course, once you have failed back into your production environment, you can then go ahead and decommission any SDDC clusters that were stood up in VMC on AWS for the duration of the failover. Hopefully this has given you an appreciation for the cloud economics available in this solution.

One additional feature which I think is really quite useful is the fact that there are continuous compliance checks run against the DR plans every 30 minutes. This means that the connectivity, networking and resource pool mappings, datastore availability, IP address availability and so on are all checked on a regular basis. This gives you confidence that both the on-premises and cloud (failover) environments have been validated before you even attempt a failover plan or even a test plan. A very nice feature. Below is a screenshot showing continuous compliance of a DR plan:

What about VMware Site Recovery?

I guess you are now wondering how this compares to our existing DR-to-the-Cloud product, VMware Site Recovery (VSR). Well, yes, there are similarities. They can both be used on VMware environments to do DR orchestration, including failover and failback. Both offerings should be familiar to a VI-Admin, as both use vSphere constructs. But there are some significant difference as well. While VMware Cloud DR replicates to a low-cost, cloud-based Scale-Out File System, VSR replicates directly to primary storage in the SDDC on VMW on AWS. This means that you need to have an SDDC environment already stood up and available in VMC on AWS for VSR. There are advantages to this of course, such as a much faster Recovery Time Objective (RTO) since the failover capacity is already provisioned. There is no overhead in waiting for an SDDC to be stood up, or a ‘pilot-light” SDDC to be scaled out, which has a direct impact on RTO.

A second difference is that VMware Site Recovery will also have a much smaller Recovery Point Objective (RPO) since it can replicate the on-premises VM data every 30 minutes or less. With VMware Cloud DR, we are looking at an RPO in the region of 4 hours.

RTO and RPO are certainly two of the main factors to consider when comparing VMware Cloud DR to VSR. And then of course, cost in the other major factor that enters the equation. These are the key points that should be evaluated when looking at any DRaaS.

VMware Cloud Disaster Recover at VMworld 2020

There were 3 presentations on VMware Cloud DR at VMworld 2020 which I already mentioned in the Top 15 sessions to watch. Here they are again. For anyone interested in VMware Cloud DR, definitely take the time out to watch these presentations.

[HCI2876] Don’t be left in the Cold: Protect vSphere with Datrium-based Warm DRaaS. This is the level 100 introductory DRaaS session and provides a business overview of DRaaS.
[HCI2865] Protect all your workloads with DRaaS from VMware. Learn all about the Datrium acquisition and how it is used to implement Disaster Recovery as a Services (DRaaS) for VMware Cloud on AWS. This should give a good level 200 overview of the solution.
[HCI2886] Deep Dive into DRaaS based on Datrium. In this breakout, we get into the nuts and bolts (level 300) of the new DRaaS solution.

The team would love to hear your feedback on VMware Cloud Disaster Recovery. Do you think you would use it? Are there reasons why you would not use it. Let us know.