Velero VMware vSphere vSphere with Tanzu

vSphere with Tanzu stateful application backup/restore using Velero vSphere Operator

Recently I wrote about our new Velero vSphere Operator. This new functionality, launched with VMware Cloud Foundation (VCF) 4.2, enables administrators to backup and restore objects in their vSphere with Tanzu namespaces. In my previous post, I showed how we could use the Velero vSphere Operator to backup and restore a stateless application (the example used was an Nginx deployment) to and from an S3 Object Store bucket. The S3 object store and bucket was provided by the Minio Operator that is also available in VCF 4.2 as part of the vSAN Data Persistent platform (DPp) offering. In this post, I will demonstrate how the Velero vSphere Operator can be used to backup and restore a stateful application in a vSphere with Tanzu namespace. I will use a 3-node Cassandra NoSQL database that uses PodVMs for compute. Note however that there is an additional setup step required before we can do this stateful backup/restore operation. This is the deployment of the Velero Data Manager onto your vSphere environment. Let’s look at that next.

Velero Data Manager

Backups and restores of persistent volumes, backed by vSphere storage, can now use VADP. These are the vSphere APIs for Data Protection, which means that the Velero vSphere Operator can use vSphere snapshots as part of the backup and restore operations. The Velero Data Manager is required to communicate with vCenter Server for snapshot operations, and well as acting as the data mover to move the snapshot data to and from the S3 Object Store. More details about the Data Manager can be found here.

Network Considerations

The Velero Data Manager is packaged as an OVA and should be deployed on your vSphere environment where vSphere with Tanzu is running. There are some networking considerations when deploying the Velero Data Manager. These can be summarized as follows:

  • A dedicated backup network should ideally be created in the cluster.
  • All ESXi hosts in the cluster should have a VMkernel interface connected to the backup network, and should be tagged with the vSphere Backup NFC service.
  • The Velero Data Manager should be deployed on the backup network.
  • The backup network should have a route to the vSphere Management network to allow the Velero Data Manager communicate to vCenter Server.
  • The backup network should have a route to the vSphere with Tanzu Load Balancer network to allow the Velero Data Manager communicate to the vSphere with Tanzu Control Plane.
  • The Velero Data Manager should have a route externally to be able to pull from docker hub. Alternatively, if you have an internal container image registry (e.g. Harbor), this connectivity is not required.

As you can see, there are a considerable number of network requirements for the Velero Data Manager. I though it might be a little easier to visualize it as follows:

Configuration Parameters

Once the Velero Data Manager has been deployed, but before you power it on, a number of Advanced Configuration Settings needs to be put in place. In the vSphere Client, navigate to the Velero Data Manager Virtual Machine, select Edit Settings, then choose the VM Options tab. Scroll down to Advanced, and in that section, click on the Edit Configuration Parameters. The following parameters will need to be updated:

guestinfo.cnsdp.wcpControlPlaneIP (default: unset, vSphere with Tanzu control plane IP address)
guestinfo.cnsdp.vcAddress (default: unset)
guestinfo.cnsdp.vcUser (default: unset)
guestinfo.cnsdp.vcPassword (default: unset)
guestinfo.cnsdp.vcPort (default: 443)
guestinfo.cnsdp.veleroNamespace (default: velero)
guestinfo.cnsdp.datamgrImage (default: unset, use image vsphereveleroplugin/data-manager-for-plugin:1.1.0)
guestinfo.cnsdp.updateKubectl (default: false, gets from  master on every VM restart)

Note that if the Velero installer has been set to a different namespace in the vSphere with Tanzu environment, you will need to change this advanced setting as well. This is not the namespace with the Velero Operator, but rather the namespace where the Velero server was installed (in my previous example, this was velero-ns). This is what it should look like in the vSphere UI:

With these changes in place, the Velero Data Manager may now be powered on.

Checking Velero Data Manager operations

Once the Velero Data Manager VM is powered on, you can login to it as root. The first time login has the root password defaulted to changme. You will be prompted to changed it on initial login. Once you are logged in, you can check the status of the advanced parameters as follows:

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.datamgrImage'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.wcpControlPlaneIP'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcAddress'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcUser'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcPasswd'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.vcPort'

root@photon-cnsdp [ ~ ]# vmtoolsd --cmd 'info-get guestinfo.cnsdp.veleroNamespace'

Note that the vCenter username and password appear as unset as a security precaution. Another item that can be checked is the status of the data manager service, as follows:.

root@photon-cnsdp [ ~ ]# systemctl status velero-datamgr.service
● velero-datamgr.service - Start Velero vsphere plugin data manager
   Loaded: loaded (/lib/systemd/system/velero-datamgr.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2021-03-02 09:00:02 UTC; 6min ago
  Process: 3084 ExecStart=/usr/bin/ (code=exited, status=0/SUCCESS)
 Main PID: 3084 (code=exited, status=0/SUCCESS)

Mar 02 08:59:55 photon-cnsdp[3084]: [1B blob data]
Mar 02 08:59:55 photon-cnsdp[3084]: To change context, use `kubectl config use-context <workload name>`
Mar 02 08:59:55 photon-cnsdp[3084]: Switched to context "vi-user".
Mar 02 09:00:01 photon-cnsdp[3084]: 1.1.0: Pulling from cormachogan/data-manager-for-plugin
Mar 02 09:00:01 photon-cnsdp[3084]: Digest: sha256:1b0ff07325aa2023bc1d915a6c016205509eee3f856b56d3c9e262ddf6d59545
Mar 02 09:00:01 photon-cnsdp[3084]: Status: Image is up to date for
Mar 02 09:00:01 photon-cnsdp[3084]: Deleted Containers:
Mar 02 09:00:01 photon-cnsdp[3084]: 3cbdcabb27c664455a359afa881006816f8cee3717540ae4bd3efec954dd070b
Mar 02 09:00:01 photon-cnsdp[3084]: Total reclaimed space: 4.143kB
Mar 02 09:00:01 photon-cnsdp[3084]: d1c3b7e29ca863c731c1af8e10829745d33187b277039a4170f0df1a9d4d2e29

Note that the velero-datamgr.service is not a daemon-like service, and that it exits once it completes its execution. The “Active: inactive (dead)” is a little misleading. As long as you see the status, “code=exited, status=0/SUCCESS”, it means the execution exited with success.

The more observant of you may notice the container being pulled from my own personal registry. You should not need to do this so long as you can successfully pull from the main docker registry, and are not impacted by the docker registry pull rate limiting implemented recently. If you are impacted, you might want to pull the image vsphereveleroplugin/data-manager-for-plugin:1.1.0 from docker registry, and push it to your own personal registry. You would also need to do this for air-gapped environments.

Last but not least, use the docker ps command to verify that the container image is running on the Velero Data Manager:

root@photon-cnsdp [ ~ ]# docker ps
CONTAINER ID        IMAGE                                               COMMAND                  CREATED             STATUS              PORTS               NAMES
754679457e7a   "/datamgr server --u…"   8 seconds ago       Up 6 seconds                            velero-datamgr

Again, ignore the fact that the image has been pulled from my personal registry on this occasion. As long as it is running, no matter where it was pulled from, you should be good to go. OK – time to take a backup and then do a restore, and verify that it is all working correctly.

Take a backup

As mentioned in the introduction, I already deployed the Velero vSphere Operator, along with the Velero Server and Client pieces in my previous post. I’m using the exact same environment for this backup test, but this time I plan to take a backup of a Cassandra NoSQL database, delete it, and then restore it. Let’s do that now:

The Cassandra StatefulSet

I am using manifests from my 101 tutorials which are located on GitHub here. Make sure you use the statefulset that has the storageClassName specified in the spec, and not in the annotations. Velero does not seem to like the latter, and that is no longer a commonly used syntax.

  - metadata:
      name: cassandra-data
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: vsan-default-storage-policy
          storage: 1Gi

Velero backup using app labels

This is what I currently have deployed in my cormac-ns namespace on vSphere with Tanzu. You can see the Cassandra Pods, Service and StatefulSet are all labeled with app=cassandra.

$ kubectl get all --show-labels
NAME                        READY   STATUS    RESTARTS   AGE   LABELS
pod/cassandra-0             1/1     Running   0          17m   app=cassandra,controller-revision-hash=cassandra-7c48cc4465,
pod/cassandra-1             1/1     Running   0          16m   app=cassandra,controller-revision-hash=cassandra-7c48cc4465,
pod/cassandra-2             1/1     Running   0          15m   app=cassandra,controller-revision-hash=cassandra-7c48cc4465,

NAME                        TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)       AGE   LABELS
service/cassandra           ClusterIP   <none>        9042/TCP      17m   app=cassandra

NAME                         READY   AGE   LABELS
statefulset.apps/cassandra   3/3     17m   app=cassandra

Initiate the backup

The following velero command will backup everything in the namespace with the label app=cassandra.

$ velero backup create cassandra-snap-backup-vdm  --selector app=cassandra --snapshot-volumes                            
Backup request "cassandra-snap-backup-vdm" submitted successfully.
Run `velero backup describe cassandra-snap-backup-vdm` or `velero backup logs cassandra-snap-backup-vdm` for more details.

$ velero backup describe cassandra-snap-backup-vdm
Name:         cassandra-snap-backup-vdm
Namespace:    velero-ns

Phase:  InProgress

Errors:    0
Warnings:  0

  Included:  *
  Excluded:  <none>

  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  app=cassandra

Storage Location:  default

Velero-Native Snapshot PVs:  true

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2021-03-03 08:25:27 +0000 GMT
Completed:  <n/a>

Expiration:  2021-04-02 09:25:27 +0100 IST

Velero-Native Snapshots: <none included>

The backup is currently in progress. Whilst this is taking place, each PVC (of which there are 3) will need to be snapshot, and then the snapshot data will need to be uploaded to the S3 object store. Special Custom Resources (CRs) have been implemented in Velero which enables us to monitor this progress.

Monitoring snapshot and upload progress

As mentioned, there will be a snapshot and upload for each PVC that is part of the Cassandra statefulset. Note that the snapshots are queried from the namespace where the backup was taken (cormac-ns), but that the uploads are queried from the Velero namespace (velero-ns). We can examine them as follows:

$ kubectl get snapshots -n cormac-ns
NAME                                        AGE
snap-77e023fb-b652-43d8-88d7-3bb2395d1e60   16s
snap-d9d4cdd7-4885-4dff-b041-ff745340d7f1   18s
snap-dd6a214d-0a04-444a-85f6-1dc69a660901   17s

$ kubectl get uploads -n velero-ns
NAME                                          AGE
upload-18a7df08-ba11-479a-b137-0cbff7c62d4e   29s
upload-4a17c0af-fe19-446b-a9af-b8ac734071a1   28s
upload-91919acc-e858-4254-9c52-a93a32fad5fa   30s

$ kubectl describe snapshot snap-dd6a214d-0a04-444a-85f6-1dc69a660901 -n cormac-ns
Name:         snap-dd6a214d-0a04-444a-85f6-1dc69a660901
Namespace:    cormac-ns
Annotations:  <none>
API Version:
Kind:         Snapshot
  Creation Timestamp:  2021-03-03T08:25:43Z
  Generation:          1
  Backup Repository:  br-afcc2134-014d-45c0-b70f-7ab46f88af16
  Resource Handle:
    API Group:
    Kind:       PersistentVolumeClaim
    Name:       cassandra-data-cassandra-1
  Completion Timestamp:  2021-03-03T08:26:27Z
  Metadata:              CpkJChpjYXNzYW5kcmEtZGF0YS1jYXNzYW5kcmEtMRIAGgljb3JtYWMtbnMiTi9hcGkvdjEvbmFtZXNwYWNlcy9jb3JtYWMtbnMvcGVyc2lzdGVudHZvbHVtZWNsYWltcy9jYXNzYW5kcmEtZGF0YS1jYXNzYW5kcmEtMSokNzMyNjJjMzYtMzIzZC00Y2NlLWE2M2QtNmU0ZjZkOThjOGY4MgkxMTMwMDg0MjI4AEIICMvk+4EGEABaEAoDYXBwEgljYXNzYW5kcmFiJgofcHYua3ViZXJuZXRlcy5pby9iaW5kLWNvbXBsZXRlZBIDeWVzYisKJHB2Lmt1YmVybmV0ZXMuaW8vYm91bmQtYnktY29udHJvbGxlchIDeWVzYkcKLXZvbHVtZS5iZXRhLmt1YmVybmV0ZXMuaW8vc3RvcmFnZS1wcm92aXNpb25lchIWY3NpLnZzcGhlcmUudm13YXJlLmNvbWI3Cil2b2x1bWVoZWFsdGguc3RvcmFnZS5rdWJlcm5ldGVzLmlvL2hlYWx0aBIKYWNjZXNzaWJsZXIca3ViZXJuZXRlcy5pby9wdmMtcHJvdGVjdGlvbnoAigHoAQoUYmFja3VwLWRyaXZlci1zZXJ2ZXISBlVwZGF0ZRoCdjEiCAjL5PuBBhAAMghGaWVsZHNWMTqvAQqsAXsiZjptZXRhZGF0YSI6eyJmOmxhYmVscyI6eyIuIjp7fSwiZjphcHAiOnt9fX0sImY6c3BlYyI6eyJmOmFjY2Vzc01vZGVzIjp7fSwiZjpyZXNvdXJjZXMiOnsiZjpyZXF1ZXN0cyI6eyIuIjp7fSwiZjpzdG9yYWdlIjp7fX19LCJmOnN0b3JhZ2VDbGFzc05hbWUiOnt9LCJmOnZvbHVtZU1vZGUiOnt9fX2KAd8CChdrdWJlLWNvbnRyb2xsZXItbWFuYWdlchIGVXBkYXRlGgJ2MSIICMzk+4EGEAAyCEZpZWxkc1YxOqMCCqACeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6cHYua3ViZXJuZXRlcy5pby9iaW5kLWNvbXBsZXRlZCI6e30sImY6cHYua3ViZXJuZXRlcy5pby9ib3VuZC1ieS1jb250cm9sbGVyIjp7fSwiZjp2b2x1bWUuYmV0YS5rdWJlcm5ldGVzLmlvL3N0b3JhZ2UtcHJvdmlzaW9uZXIiOnt9fX0sImY6c3BlYyI6eyJmOnZvbHVtZU5hbWUiOnt9fSwiZjpzdGF0dXMiOnsiZjphY2Nlc3NNb2RlcyI6e30sImY6Y2FwYWNpdHkiOnsiLiI6e30sImY6c3RvcmFnZSI6e319LCJmOnBoYXNlIjp7fX19igGHAQoOdnNwaGVyZS1zeW5jZXISBlVwZGF0ZRoCdjEiCAjg5vuBBhAAMghGaWVsZHNWMTpVClN7ImY6bWV0YWRhdGEiOnsiZjphbm5vdGF0aW9ucyI6eyJmOnZvbHVtZWhlYWx0aC5zdG9yYWdlLmt1YmVybmV0ZXMuaW8vaGVhbHRoIjp7fX19fRJ2Cg1SZWFkV3JpdGVPbmNlEhISEAoHc3RvcmFnZRIFCgMxR2kaKHB2Yy03MzI2MmMzNi0zMjNkLTRjY2UtYTYzZC02ZTRmNmQ5OGM4ZjgqG3ZzYW4tZGVmYXVsdC1zdG9yYWdlLXBvbGljeTIKRmlsZXN5c3RlbRooCgVCb3VuZBINUmVhZFdyaXRlT25jZRoQCgdzdG9yYWdlEgUKAzFHaQ==
  Phase:                 Uploaded
  Snapshot ID:        pvc:cormac-ns/cassandra-data-cassandra-1:aXZkOjcxZWEwOTIxLTNhY2EtNDIyMi1hNTc1LTNiYWNiNmQyZTFlZjoxOGE3ZGYwOC1iYTExLTQ3OWEtYjEzNy0wY2JmZjdjNjJkNGU
  Svc Snapshot Name:
Events:               <none>

$ kubectl describe upload upload-18a7df08-ba11-479a-b137-0cbff7c62d4e -n velero-ns
Name:         upload-18a7df08-ba11-479a-b137-0cbff7c62d4e
Namespace:    velero-ns
Annotations:  <none>
API Version:
Kind:         Upload
  Creation Timestamp:  2021-03-03T08:25:44Z
  Generation:          3
  Backup Repository:   br-afcc2134-014d-45c0-b70f-7ab46f88af16
  Backup Timestamp:    2021-03-03T08:25:44Z
  Snapshot ID:         ivd:71ea0921-3aca-4222-a575-3bacb6d2e1ef:18a7df08-ba11-479a-b137-0cbff7c62d4e
  Snapshot Reference:  cormac-ns/snap-dd6a214d-0a04-444a-85f6-1dc69a660901
  Completion Timestamp:  2021-03-03T08:26:27Z
  Message:               Upload completed
  Next Retry Timestamp:  2021-03-03T08:25:44Z
  Phase:                 Completed
  Processing Node:
  Start Timestamp:  2021-03-03T08:25:44Z
Events:             <none>

Once all backup snapshots have been uploaded to the S3 object store, the backup job should report completed.

$ velero backup get
NAME                        STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
cassandra-snap-backup-vdm   Completed   0        4          2021-03-03 08:25:27 +0000 GMT   29d       default            app=cassandra

Do a Restore

With everything backed up, I can now try a restore operation. For the purposes of this exercise, I deleted my Cassandra statefulset from the cormac-ns namespace, and I also removed the PVCs – Persistent Volume Claims. I can now initiate a restore operation using the backup that I took in the previous step. Here is the command:

$ velero restore create cassandra-snap-restore-vdm --from-backup cassandra-snap-backup-vdm
Restore request "cassandra-snap-restore-vdm" submitted successfully.
Run `velero restore describe cassandra-snap-restore-vdm` or `velero restore logs cassandra-snap-restore-vdm` for more details.

$ velero restore describe cassandra-snap-restore-vdm
Name:         cassandra-snap-restore-vdm
Namespace:    velero-ns
Labels:       <none>
Annotations:  <none>

Phase:  InProgress

Started:    2021-03-03 08:38:48 +0000 GMT
Completed:  <n/a>

Backup:  cassandra-snap-backup-vdm

  Included:  all namespaces found in the backup
  Excluded:  <none>

  Included:        *
  Excluded:        nodes, events,,,,
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Now, similar to how a backup worked, a restore must download the snapshot data contents from the S3 object store in order to restore the PVCs. Again, there is a special Velero Custom Resource that allows us to do that.

Monitoring download progress

The following command will allow us to examine the download of the snapshot data from the backup destination (S3 object store). Note that whereas the snapshot and upload operations seems to occur in parallel, the downloads appear to happen in sequential order, one after another.

$ kubectl get downloads -n velero-ns
NAME                                                                                 AGE
download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4   31s

$ kubectl describe download download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4 -n velero-ns
Name:         download-91919acc-e858-4254-9c52-a93a32fad5fa-097ea200-3055-46ee-a0d1-57b055df99a4
Namespace:    velero-ns
Annotations:  <none>
API Version:
Kind:         Download
  Creation Timestamp:  2021-03-03T08:38:49Z
  Generation:          1
  Backup Repository Name:        br-afcc2134-014d-45c0-b70f-7ab46f88af16
  Clonefrom Snapshot Reference:  cormac-ns/641a618b-5466-4ec9-af83-88934ad01063
  Protected Entity ID:           ivd:fae8c416-2a08-4cdc-8b91-c1acdc562306
  Restore Timestamp:             2021-03-03T08:38:49Z
  Snapshot ID:                   ivd:fae8c416-2a08-4cdc-8b91-c1acdc562306:91919acc-e858-4254-9c52-a93a32fad5fa
  Next Retry Timestamp:  2021-03-03T08:38:49Z
  Phase:                 New
Events:  <none>

Eventually all 3 downloads will appear (one for each PVC) and eventually complete, all going well.

Checking the Restore

Once the Cassandra statefulset has been restored, you can check it as follows.

$ velero get restores
NAME                         BACKUP                      STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
cassandra-snap-restore-vdm   cassandra-snap-backup-vdm   Completed   2021-03-03 08:38:48 +0000 GMT   2021-03-03 09:00:39 +0000 GMT   0        0          2021-03-03 08:38:48 +0000 GMT   <none>

$ kubectl get all
NAME                         READY   STATUS    RESTARTS   AGE
pod/cassandra-0              1/1     Running   0          3m44s
pod/cassandra-1              1/1     Running   2          3m44s
pod/cassandra-2              1/1     Running   0          2m16s

NAME                         TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
service/cassandra            ClusterIP   <none>        9042/TCP                     3m44s

NAME                         READY   AGE
statefulset.apps/cassandra   3/3     3m44s

$ kubectl get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
cassandra-data-cassandra-0    Bound    pvc-077523ec-06d5-4279-a890-2f4a2fb3f831   1Gi        RWO            vsan-default-storage-policy   38m
cassandra-data-cassandra-1    Bound    pvc-5ac8ae52-dc86-44be-9ba7-9f190fd2f60b   1Gi        RWO            vsan-default-storage-policy   32m
cassandra-data-cassandra-2    Bound    pvc-e929d81b-6549-4bd4-9e16-bc9842370f83   1Gi        RWO            vsan-default-storage-policy   22m

There you have it. A stateful application running in a vSphere with Tanzu namespace, backed up and restored using the Velero vSphere Operator and the Velero Data Manager appliance. Further details can be found in the official documentation on GitHub.

4 replies on “vSphere with Tanzu stateful application backup/restore using Velero vSphere Operator”

Hi Cormac,
I did now stop further investigations.
After several days of debugging (creating lots of log bundles), support on GitHub was not able to identify the root cause 🙁
It seems, that I now have to consider a commercial solution for Backup & Restore of Tanzu stateful applications.
Best regards,

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.