Recently I wrote about our new Velero vSphere Operator. This new functionality, launched with VMware Cloud Foundation (VCF) 4.2, enables administrators to backup and restore objects in their vSphere with Tanzu namespaces. In my previous post, I showed how we could use the Velero vSphere Operator to backup and restore a stateless application (the example used was an Nginx deployment) to and from an S3 Object Store bucket. The S3 object store and bucket was provided by the Minio Operator that is also available in VCF 4.2 as part of the vSAN Data Persistent platform (DPp) offering. In this post, I will demonstrate how the Velero vSphere Operator can be used to backup and restore a stateful application in a vSphere with Tanzu namespace. I will use a 3-node Cassandra NoSQL database that uses PodVMs for compute. Note however that there is an additional setup step required before we can do this stateful backup/restore operation. This is the deployment of the Velero Data Manager onto your vSphere environment. Let’s look at that next.
Velero Data Manager
Backups and restores of persistent volumes, backed by vSphere storage, can now use VADP. These are the vSphere APIs for Data Protection, which means that the Velero vSphere Operator can use vSphere snapshots as part of the backup and restore operations. The Velero Data Manager is required to communicate with vCenter Server for snapshot operations, and well as acting as the data mover to move the snapshot data to and from the S3 Object Store. More details about the Data Manager can be found here.
The Velero Data Manager is packaged as an OVA and should be deployed on your vSphere environment where vSphere with Tanzu is running. There are some networking considerations when deploying the Velero Data Manager. These can be summarized as follows:
- A dedicated backup network should ideally be created in the cluster.
- All ESXi hosts in the cluster should have a VMkernel interface connected to the backup network, and should be tagged with the vSphere Backup NFC service.
- The Velero Data Manager should be deployed on the backup network.
- The backup network should have a route to the vSphere Management network to allow the Velero Data Manager communicate to vCenter Server.
- The backup network should have a route to the vSphere with Tanzu Load Balancer network to allow the Velero Data Manager communicate to the vSphere with Tanzu Control Plane.
- The Velero Data Manager should have a route externally to be able to pull from docker hub. Alternatively, if you have an internal container image registry (e.g. Harbor), this connectivity is not required.
As you can see, there are a considerable number of network requirements for the Velero Data Manager. I though it might be a little easier to visualize it as follows:
Once the Velero Data Manager has been deployed, but before you power it on, a number of Advanced Configuration Settings needs to be put in place. In the vSphere Client, navigate to the Velero Data Manager Virtual Machine, select Edit Settings, then choose the VM Options tab. Scroll down to Advanced, and in that section, click on the Edit Configuration Parameters. The following parameters will need to be updated:
guestinfo.cnsdp.wcpControlPlaneIP (default: unset, vSphere with Tanzu control plane IP address) guestinfo.cnsdp.vcAddress (default: unset) guestinfo.cnsdp.vcUser (default: unset) guestinfo.cnsdp.vcPassword (default: unset) guestinfo.cnsdp.vcPort (default: 443) guestinfo.cnsdp.veleroNamespace (default: velero) guestinfo.cnsdp.datamgrImage (default: unset, use image vsphereveleroplugin/data-manager-for-plugin:1.1.0) guestinfo.cnsdp.updateKubectl (default: false, gets from master on every VM restart)
Note that if the Velero installer has been set to a different namespace in the vSphere with Tanzu environment, you will need to change this advanced setting as well. This is not the namespace with the Velero Operator, but rather the namespace where the Velero server was installed (in my previous example, this was velero-ns). This is what it should look like in the vSphere UI:
With these changes in place, the Velero Data Manager may now be powered on.
Checking Velero Data Manager operations
Once the Velero Data Manager VM is powered on, you can login to it as root. The first time login has the root password defaulted to changme. You will be prompted to changed it on initial login. Once you are logged in, you can check the status of the advanced parameters as follows:
root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.datamgrImage' unset root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.wcpControlPlaneIP' 18.104.22.168 root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcAddress' 192.168.0.100 root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcUser' unset root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcPasswd' unset root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcPort' 443 root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.veleroNamespace' velero-ns
Note that the vCenter username and password appear as unset as a security precaution. Another item that can be checked is the status of the data manager service, as follows:.
root@photon-cnsdp [ ~ ]# systemctl status velero-datamgr.service ● velero-datamgr.service - Start Velero vsphere plugin data manager Loaded: loaded (/lib/systemd/system/velero-datamgr.service; enabled; vendor preset: enabled) Active: inactive (dead) since Tue 2021-03-02 09:00:02 UTC; 6min ago Docs: https://github.com/vmware-tanzu/velero-plugin-for-vsphere Process: 3084 ExecStart=/usr/bin/velero-vsphere-plugin-datamgr.sh (code=exited, status=0/SUCCESS) Main PID: 3084 (code=exited, status=0/SUCCESS) Mar 02 08:59:55 photon-cnsdp velero-vsphere-plugin-datamgr.sh: [1B blob data] Mar 02 08:59:55 photon-cnsdp velero-vsphere-plugin-datamgr.sh: To change context, use `kubectl config use-context <workload name>` Mar 02 08:59:55 photon-cnsdp velero-vsphere-plugin-datamgr.sh: Switched to context "vi-user". Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: 1.1.0: Pulling from cormachogan/data-manager-for-plugin Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: Digest: sha256:1b0ff07325aa2023bc1d915a6c016205509eee3f856b56d3c9e262ddf6d59545 Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: Status: Image is up to date for quay.io/cormachogan/data-manager-for-plugin:1.1.0 Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: Deleted Containers: Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: 3cbdcabb27c664455a359afa881006816f8cee3717540ae4bd3efec954dd070b Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: Total reclaimed space: 4.143kB Mar 02 09:00:01 photon-cnsdp velero-vsphere-plugin-datamgr.sh: d1c3b7e29ca863c731c1af8e10829745d33187b277039a4170f0df1a9d4d2e29
Note that the velero-datamgr.service is not a daemon-like service, and that it exits once it completes its execution. The “Active: inactive (dead)” is a little misleading. As long as you see the status, “code=exited, status=0/SUCCESS”, it means the execution exited with success.
The more observant of you may notice the container being pulled from my own personal registry. You should not need to do this so long as you can successfully pull from the main docker registry, and are not impacted by the docker registry pull rate limiting implemented recently. If you are impacted, you might want to pull the image vsphereveleroplugin/data-manager-for-plugin:1.1.0 from docker registry, and push it to your own personal registry. You would also need to do this for air-gapped environments.
Last but not least, use the docker ps command to verify that the container image is running on the Velero Data Manager:
root@photon-cnsdp [ ~ ]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 754679457e7a quay.io/cormachogan/data-manager-for-plugin:1.1.0 "/datamgr server --u…" 8 seconds ago Up 6 seconds velero-datamgr
Again, ignore the fact that the image has been pulled from my personal registry on this occasion. As long as it is running, no matter where it was pulled from, you should be good to go. OK – time to take a backup and then do a restore, and verify that it is all working correctly.
Take a backup
As mentioned in the introduction, I already deployed the Velero vSphere Operator, along with the Velero Server and Client pieces in my previous post. I’m using the exact same environment for this backup test, but this time I plan to take a backup of a Cassandra NoSQL database, delete it, and then restore it. Let’s do that now:
The Cassandra StatefulSet
I am using manifests from my 101 tutorials which are located on GitHub here. Make sure you use the statefulset that has the storageClassName specified in the spec, and not in the annotations. Velero does not seem to like the latter, and that is no longer a commonly used syntax.
volumeClaimTemplates: - metadata: name: cassandra-data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: vsan-default-storage-policy resources: requests: storage: 1Gi
Velero backup using app labels
This is what I currently have deployed in my cormac-ns namespace on vSphere with Tanzu. You can see the Cassandra Pods, Service and StatefulSet are all labeled with app=cassandra.
$ kubectl get all --show-labels NAME READY STATUS RESTARTS AGE LABELS pod/cassandra-0 1/1 Running 0 17m app=cassandra,controller-revision-hash=cassandra-7c48cc4465,statefulset.kubernetes.io/pod-name=cassandra-0 pod/cassandra-1 1/1 Running 0 16m app=cassandra,controller-revision-hash=cassandra-7c48cc4465,statefulset.kubernetes.io/pod-name=cassandra-1 pod/cassandra-2 1/1 Running 0 15m app=cassandra,controller-revision-hash=cassandra-7c48cc4465,statefulset.kubernetes.io/pod-name=cassandra-2 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE LABELS service/cassandra ClusterIP 10.96.1.186 <none> 9042/TCP 17m app=cassandra NAME READY AGE LABELS statefulset.apps/cassandra 3/3 17m app=cassandra
Initiate the backup
The following velero command will backup everything in the namespace with the label app=cassandra.
$ velero backup create cassandra-snap-backup-vdm --selector app=cassandra --snapshot-volumes Backup request "cassandra-snap-backup-vdm" submitted successfully. Run `velero backup describe cassandra-snap-backup-vdm` or `velero backup logs cassandra-snap-backup-vdm` for more details. $ velero backup describe cassandra-snap-backup-vdm Name: cassandra-snap-backup-vdm Namespace: velero-ns Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.18.2-6+38ac483e736488 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=18+ Phase: InProgress Errors: 0 Warnings: 0 Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: app=cassandra Storage Location: default Velero-Native Snapshot PVs: true TTL: 720h0m0s Hooks: <none> Backup Format Version: 1.1.0 Started: 2021-03-03 08:25:27 +0000 GMT Completed: <n/a> Expiration: 2021-04-02 09:25:27 +0100 IST Velero-Native Snapshots: <none included>
The backup is currently in progress. Whilst this is taking place, each PVC (of which there are 3) will need to be snapshot, and then the snapshot data will need to be uploaded to the S3 object store. Special Custom Resources (CRs) have been implemented in Velero which enables us to monitor this progress.
Monitoring snapshot and upload progress
As mentioned, there will be a snapshot and upload for each PVC that is part of the Cassandra statefulset. Note that the snapshots are queried from the namespace where the backup was taken (cormac-ns), but that the uploads are queried from the Velero namespace (velero-ns). We can examine them as follows:
Once all backup snapshots have been uploaded to the S3 object store, the backup job should report completed.
$ velero backup get NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR cassandra-snap-backup-vdm Completed 0 4 2021-03-03 08:25:27 +0000 GMT 29d default app=cassandra
Do a Restore
With everything backed up, I can now try a restore operation. For the purposes of this exercise, I deleted my Cassandra statefulset from the cormac-ns namespace, and I also removed the PVCs – Persistent Volume Claims. I can now initiate a restore operation using the backup that I took in the previous step. Here is the command:
$ velero restore create cassandra-snap-restore-vdm --from-backup cassandra-snap-backup-vdm Restore request "cassandra-snap-restore-vdm" submitted successfully. Run `velero restore describe cassandra-snap-restore-vdm` or `velero restore logs cassandra-snap-restore-vdm` for more details. $ velero restore describe cassandra-snap-restore-vdm Name: cassandra-snap-restore-vdm Namespace: velero-ns Labels: <none> Annotations: <none> Phase: InProgress Started: 2021-03-03 08:38:48 +0000 GMT Completed: <n/a> Backup: cassandra-snap-backup-vdm Namespaces: Included: all namespaces found in the backup Excluded: <none> Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto
Now, similar to how a backup worked, a restore must download the snapshot data contents from the S3 object store in order to restore the PVCs. Again, there is a special Velero Custom Resource that allows us to do that.
Monitoring download progress
The following command will allow us to examine the download of the snapshot data from the backup destination (S3 object store). Note that whereas the snapshot and upload operations seems to occur in parallel, the downloads appear to happen in sequential order, one after another.
$ kubectl get downloads -n velero-ns NAME AGE download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4 31s $ kubectl describe download download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4 -n velero-ns Name: download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4 Namespace: velero-ns Labels: velero.io/exclude-from-backup=true Annotations: <none> API Version: datamover.cnsdp.vmware.com/v1alpha1 Kind: Download Metadata: Creation Timestamp: 2021-03-03T08:38:49Z Generation: 1 . --<snip> . Spec: Backup Repository Name: br-afcc2134-014d-45c0-b70f-7ab46f88af16 Clonefrom Snapshot Reference: cormac-ns/641a618b-5466-4ec9-af83-88934ad01063 Protected Entity ID: ivd:fae8c416-2a08-4cdc-8b91-c1acdc562306 Restore Timestamp: 2021-03-03T08:38:49Z Snapshot ID: ivd:fae8c416-2a08-4cdc-8b91-c1acdc562306:91919acc-e858-4254-9c52-a93a32fad5fa Status: Next Retry Timestamp: 2021-03-03T08:38:49Z Phase: New Progress: Events: <none>
Eventually all 3 downloads will appear (one for each PVC) and eventually complete, all going well.
Checking the Restore
Once the Cassandra statefulset has been restored, you can check it as follows.
$ velero get restores NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR cassandra-snap-restore-vdm cassandra-snap-backup-vdm Completed 2021-03-03 08:38:48 +0000 GMT 2021-03-03 09:00:39 +0000 GMT 0 0 2021-03-03 08:38:48 +0000 GMT <none> $ kubectl get all NAME READY STATUS RESTARTS AGE pod/cassandra-0 1/1 Running 0 3m44s pod/cassandra-1 1/1 Running 2 3m44s pod/cassandra-2 1/1 Running 0 2m16s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cassandra ClusterIP 10.96.2.237 <none> 9042/TCP 3m44s NAME READY AGE statefulset.apps/cassandra 3/3 3m44s $ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cassandra-data-cassandra-0 Bound pvc-077523ec-06d5-4279-a890-2f4a2fb3f831 1Gi RWO vsan-default-storage-policy 38m cassandra-data-cassandra-1 Bound pvc-5ac8ae52-dc86-44be-9ba7-9f190fd2f60b 1Gi RWO vsan-default-storage-policy 32m cassandra-data-cassandra-2 Bound pvc-e929d81b-6549-4bd4-9e16-bc9842370f83 1Gi RWO vsan-default-storage-policy 22m
There you have it. A stateful application running in a vSphere with Tanzu namespace, backed up and restored using the Velero vSphere Operator and the Velero Data Manager appliance. Further details can be found in the official documentation on GitHub.