Velero vSphere Operator backup/restore TKG “guest” cluster objects in vSphere with Tanzu
Over the past week or so, I have posted a number of blogs on how to get started with the new Velero vSphere Operator. I showed how to deploy the Operator in the Supervisor Cluster of vSphere with Tanzu, and also how to install the Velero and Backupdriver components in the Supervisor. We then went on to take backups and do restores of both stateless (e.g. Nginx deployment) and stateful (e.g. Cassandra StatefulSet) which were running as PodVMs is a Supervisor cluster. In the latter post, we saw how the new Velero Data Manager acted as the interface between Velero, vSphere, and the backup destination. It contains both snapshot and data mover capabilities to move the data to the Velero backup destination, in our case a MinIO S3 Object Store bucket. I want to complete this series of blog posts with one final example, and this is to show how to back up and restore a stateful application from within a Tanzu Kubernetes Grid (TKG) “guest” cluster. This TKG cluster has been deployed by the TKG service (TKGs) to a namespace on vSphere with Tanzu. Within that TKG cluster, I will run a Cassandra Stateful Set application, then back it up, delete it and restore it. My backup (and restore destination) is the same S3 object store bucket that I used when backing up objects in the Supervisor cluster.
Requirements
We have already implement most of the requirements for backing up TKG objects. The destination S3 Object Storage bucket has been put in place. In my case, I used the new feature available in VMware Cloud Foundation version 4.2, the vSAN Data Persistence platform (DPp). I used this feature to deploy a MinIO S3 Object storage and bucket to act as a backup destination. As mentioned, I have already deployed the Velero vSphere Operator on the vSphere with Tanzu Supervisor Cluster, as well as the Velero Data Manager. In fact, the only step that remains for me to do is to deploy the Velero and Backupdriver components in the TKG “guest” cluster. We did this previously for the Supervisor cluster, but now we must do it for the TKG guest cluster. Let’s do that next.
Installation
The installation of the Velero and Backup driver is much the same as a standard deployment in vanilla Kubernetes, with the only difference being that you must include the vSphere plugin. This is so Velero knows to use vSphere snapshots (rather than a restic file copy, for example). This blog site has many examples of deploying Velero, and there is even an example of how to deploy the vSphere plugin from last year here. Note that this install command is run from the context of the TKG guest cluster. For the purposes of keeping everything in one place, here is an example install command:
./velero install \ --provider aws \ --bucket backup-bucket\ --secret-file ./velero-minio-credentials \ --namespace velero \ --plugins "velero/velero-plugin-for-aws:v1.1.0","vsphereveleroplugin/velero-plugin-for-vsphere:1.1.0"\ --backup-location-config region=minio,s3ForcePathStyle="true",s3Url="http://20.0.05",publicUrl="http://20.0.0.5"\ --snapshot-location-config region=minio
Many of these entries will be familiar to you from when we did the install in the Supervisor cluster. Many of the parameters relate to the MinIO S3 Object Storage and bucket, which we are using as the backup destination. There are 2 plugins listed, one for how to utilize the S3 object store (aws) and the other is to utilize vSphere snapshots (vsphere). Once the installer is run, there should be 2 deployment objects (as well as 2 replicaSets and 2 Pods) in the velero namespace in the TKG guest cluster. One is for the Velero server component, whilst the other is for the backupdriver. Let’s take a closer look.
$ kubectl config get-contexts CURRENT NAME CLUSTER AUTHINFO NAMESPACE 20.0.0.1 20.0.0.1 wcp:20.0.0.1:administrator@vsphere.local cormac-ns 20.0.0.1 wcp:20.0.0.1:administrator@vsphere.local cormac-ns minio-domain-c8 20.0.0.1 wcp:20.0.0.1:administrator@vsphere.local minio-domain-c8 * tkg-cluster-vcf-w-tanzu 20.0.0.3 wcp:20.0.0.3:administrator@vsphere.local velero-ns 20.0.0.1 wcp:20.0.0.1:administrator@vsphere.local velero-ns velero-vsphere-domain-c8 20.0.0.1 wcp:20.0.0.1:administrator@vsphere.local velero-vsphere-domain-c8 $ kubectl get ns NAME STATUS AGE cassandra Active 3d cormac-guest-ns Active 40d default Active 84d kube-node-lease Active 84d kube-public Active 84d kube-system Active 84d music-system Active 83d velero Active 10m velero-vsphere-plugin-backupdriver Active 10m vmware-system-auth Active 84d vmware-system-cloud-provider Active 84d vmware-system-csi Active 84d $ kubectl get all -n velero NAME READY STATUS RESTARTS AGE pod/backup-driver-5567f6ccfd-lzp76 1/1 Running 0 2d23h pod/velero-8469fd4f65-gs4kv 1/1 Running 0 2d23h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/backup-driver 1/1 1 1 2d23h deployment.apps/velero 1/1 1 1 2d23h NAME DESIRED CURRENT READY AGE replicaset.apps/backup-driver-5567f6ccfd 1 1 1 2d23h replicaset.apps/velero-8469fd4f65 1 1 1 2d23h
Prep Cassandra for backup
$ kubectl exec -it -n cassandra pod/cassandra-2 -- nodetool status Datacenter: DC1-TKG-Demo ======================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.4 137.32 KiB 32 100.0% 15890386-5681-48ff-bba4-1d9218f23ed2 Rack1-TKG-Demo UN 192.168.2.6 141.32 KiB 32 100.0% 43b4a8c5-9055-4382-8c74-a1d543277742 Rack1-TKG-Demo UN 192.168.2.7 133.92 KiB 32 100.0% 9ce483b2-9343-4254-bf04-87d41fac39fa Rack1-TKG-Demo $ kubectl exec -it -n cassandra pod/cassandra-2 -- cqlsh Connected to TKG-Demo at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4] Use HELP for help. cqlsh> use demodb; cqlsh:demodb> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal --------+----------+----------+-----------+--------- 100 | Cork | Cormac | 999 | 1000000 (1 rows) cqlsh:demodb> exit
Take a backup
Let’s kick off a backup which only backs up objects related to the Cassandra namespace:
$ velero backup create cassandra-tkg-cass-snap --include-namespaces cassandra --snapshot-volumes Backup request "cassandra-tkg-cass-snap" submitted successfully. Run `velero backup describe cassandra-tkg-cass-snap` or `velero backup logs cassandra-tkg-cass-snap` for more details. $ velero backup describe cassandra-tkg-cass-snap Name: cassandra-tkg-cass-snap Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.18.10+vmware.1 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=18 Phase: InProgress Errors: 0 Warnings: 0 Namespaces: Included: cassandra Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: <none> Storage Location: default Velero-Native Snapshot PVs: true TTL: 720h0m0s Hooks: <none> Backup Format Version: 1.1.0 Started: 2021-03-08 09:21:27 +0000 GMT Completed: <n/a> Expiration: 2021-04-07 10:21:27 +0100 IST Estimated total items to be backed up: 15 Items backed up so far: 0 Velero-Native Snapshots: <none included>
This should trigger a snapshot operation within the TKG cluster for each of the 3 volumes used by this Cassandra StatefulSet.
$ kubectl get snapshots -n cassandra NAMESPACE NAME AGE cassandra snap-4f4b232e-0f1b-4bb9-a459-62d9251f2d10 44s cassandra snap-a255ce8b-7017-48a8-8611-1a46c796093f 34s cassandra snap-f7e58144-2bda-4830-b887-1e61745187e4 39s
And after some time, the backup will show up as completed.
$ velero backup get cassandra-tkg-cass-snap NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR cassandra-tkg-cass-snap Completed 0 0 2021-03-08 09:21:27 +0000 GMT 30d default <none>
However, even though the status of the backup operation shows up as completed, this may not be the case. That is because the status is not taking into account whether or not the backup has been uploaded to the S3 object store bucket. In order to check that, we will have to change contexts from the TKG “guest” cluster, back to the Supervisor cluster. From here we can check the status of the snapshots as well as the uploads. As mentioned in the previous posts, the snapshots are available in the namespace of the objects being backed up, and the uploads are available in the velero namespace in the Supervisor cluster context:
$ kubectl-vsphere login --vsphere-username administrator@vsphere.local \ --server=https://20.0.0.1 --insecure-skip-tls-verify Password: ***** Logged in successfully. You have access to the following contexts: 20.0.0.1 cormac-ns minio-domain-c8 velero-ns velero-vsphere-domain-c8 If the context you wish to use is not in this list, you may need to try logging in again later, or contact your cluster administrator. To change context, use `kubectl config use-context <workload name>` $ kubectl get snapshots -n cormac-ns NAMESPACE NAME AGE cormac-ns snap-cfda8beb-c38a-4406-afb3-0f50787b4e8b 93s cormac-ns snap-f53b2a12-ff8c-4f84-802f-ebfdf576c3b8 88s cormac-ns snap-f845ad16-f170-4b5e-917e-0d4eab98cf83 99s $ kubectl get uploads -n velero-ns NAMESPACE NAME AGE velero-ns upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99 97s velero-ns upload-405099b3-8e1b-4dff-9bbd-4251955608f3 92s velero-ns upload-959f62cb-0d97-44e7-aac0-8eb7b205796f 103s
You can use the describe command to monitor the Phase field to see how far they have progressed:
$ kubectl describe snapshots -n cormac-ns \
snap-cfda8beb-c38a-4406-afb3-0f50787b4e8b \
snap-f53b2a12-ff8c-4f84-802f-ebfdf576c3b8 \
snap-f845ad16-f170-4b5e-917e-0d4eab98cf83 | grep -i phase
f:phase:
Phase: Uploaded
f:phase:
Phase: Uploaded
f:phase:
Phase: Uploaded
$ kubectl describe uploads -n velero-ns \
upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99 \
upload-405099b3-8e1b-4dff-9bbd-4251955608f3 \
upload-959f62cb-0d97-44e7-aac0-8eb7b205796f | grep -i phase
f:phase:
Phase: New
f:phase:
Phase: Completed
f:phase:
Phase: Completed
$ kubectl describe uploads -n velero-ns \
upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99 \
upload-405099b3-8e1b-4dff-9bbd-4251955608f3 \
upload-959f62cb-0d97-44e7-aac0-8eb7b205796f | grep -i phase
f:phase:
Phase: Completed
f:phase:
Phase: Completed
f:phase:
Phase: Completed
A status of New means the upload has not yet completed. Once all uploads have completed for each snapshot, it is now stored in the S3 Object Store bucket, so the backup is now complete.
Delete previously backed up objects
Now that the backup is taken, let’s see if I can restore it. Before doing that, I am going to remove the Cassandra NoSQL deployment that I just backed up.
$ kubectl get all -n cassandra NAME READY STATUS RESTARTS AGE pod/cassandra-0 1/1 Running 0 3d pod/cassandra-1 1/1 Running 0 3d pod/cassandra-2 1/1 Running 0 3d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cassandra ClusterIP 10.107.129.124 <none> 9042/TCP 3d1h NAME READY AGE statefulset.apps/cassandra 3/3 3d $ kubectl get pvc -n cassandra NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cassandra-data-cassandra-0 Bound pvc-86599488-79ae-46a8-a6b4-c04c425d8ae0 1Gi RWO vsan-default-storage-policy 3d1h cassandra-data-cassandra-1 Bound pvc-3b8aee0e-2341-42f2-aa8b-fc3e0345fac7 1Gi RWO vsan-default-storage-policy 3d cassandra-data-cassandra-2 Bound pvc-8ae40ba4-6650-4872-a2bc-d3877909751f 1Gi RWO vsan-default-storage-policy 3d $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-2ebbd21d-34b8-4287-a68f-747407506636 5Gi RWO Delete Bound default/block-pvc vsan-default-storage-policy 40d pvc-3b8aee0e-2341-42f2-aa8b-fc3e0345fac7 1Gi RWO Delete Bound cassandra/cassandra-data-cassandra-1 vsan-default-storage-policy 3d pvc-86599488-79ae-46a8-a6b4-c04c425d8ae0 1Gi RWO Delete Bound cassandra/cassandra-data-cassandra-0 vsan-default-storage-policy 3d1h pvc-8ae40ba4-6650-4872-a2bc-d3877909751f 1Gi RWO Delete Bound cassandra/cassandra-data-cassandra-2 vsan-default-storage-policy 3d $ kubectl delete statefulset.apps/cassandra -n cassandra statefulset.apps "cassandra" deleted $ kubectl delete pvc cassandra-data-cassandra-0 cassandra-data-cassandra-1 cassandra-data-cassandra-2 -n cassandra persistentvolumeclaim "cassandra-data-cassandra-0" deleted persistentvolumeclaim "cassandra-data-cassandra-1" deleted persistentvolumeclaim "cassandra-data-cassandra-2" deleted $ kubectl get all -n cassandra NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cassandra ClusterIP 10.107.129.124 <none> 9042/TCP 3d1h $ kubectl get pvc -n cassandra No resources found in cassandra namespace. $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-2ebbd21d-34b8-4287-a68f-747407506636 5Gi RWO Delete Bound default/block-pvc vsan-default-storage-policy 40d $ kubectl delete ns cassandra namespace "cassandra” deleted
Restore the backup
Now we can try to restore the previous backup taken of my Cassandra namespace.
$ velero backup get cassandra-tkg-cass-snap NAME STATUS ERRORS WARNINGS CREATED cassandra-tkg-cass-snap Completed 0 0 2021-03-08 09:21:27 +0000 GMT 29d default <none> $ velero restore create cassandra-tkg-cass-snap-restore --from-backup cassandra-tkg-cass-snap Restore request "cassandra-tkg-cass-snap-restore" submitted successfully. Run `velero restore describe cassandra-tkg-cass-snap-restore` or `velero restore logs cassandra-tkg-cass-snap-restore` for more details. $ velero restore get NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR cassandra-tkg-cass-snap-restore cassandra-tkg-cass-snap InProgress 2021-03-08 12:22:21 +0000 GMT <nil> 0 0 2021-03-08 12:22:11 +0000 GMT <none> $ velero restore describe cassandra-tkg-cass-snap-restore Name: cassandra-tkg-cass-snap-restore Namespace: velero Labels: <none> Annotations: <none> Phase: InProgress Started: 2021-03-08 12:22:21 +0000 GMT Completed: <n/a> Backup: cassandra-tkg-cass-snap Namespaces: Included: all namespaces found in the backup Excluded: <none> Resources Included: * Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto
Now, for the restore to complete, the snapshot data that was previously uploaded to the S3 object store bucket must now be downloaded. This can once again be viewed from the velero namespace (in my setup, called velero-ns) in the Supervisor cluster by switching contexts. Note that these downloads are done sequentially rather than in parallel, so we will have to wait for all 3 backups to be downloaded and restored, one for each volume in the Cassandra StatefulSet. Again, from the describe option, we can monitor for the download Phase to change to Completed.
$ kubectl get downloads -n velero-ns NAME AGE download-057d9fbc-d82b-4239-aa6a-a74f4dff0c99-e84fe5e8-4662-403c-8c0a-79371cfb4a87 93s download-959f62cb-0d97-44e7-aac0-8eb7b205796f-00c3d228-d2af-441f-a803-e26a58ec98a8 2m6s $ kubectl get downloads -n velero-ns NAME AGE download-057d9fbc-d82b-4239-aa6a-a74f4dff0c99-e84fe5e8-4662-403c-8c0a-79371cfb4a87 12m download-405099b3-8e1b-4dff-9bbd-4251955608f3-c10d2608-03e4-4e09-92ed-8150216c7aba 4m2s download-959f62cb-0d97-44e7-aac0-8eb7b205796f-00c3d228-d2af-441f-a803-e26a58ec98a8 12m $ kubectl describe downloads -n velero-ns | grep -i phase f:phase: Phase: Completed f:phase: Phase: New f:phase: Phase: Completed $ kubectl describe downloads -n velero-ns | grep -i phase f:phase: Phase: Completed f:phase: Phase: Completed f:phase: Phase: Completed
Once all download Phases have a status of Completed, the restore should also be complete. We can change contexts back to the TKG cluster and see if the Cassandra namespace, and all related objects have been successfully restored. Assuming the restore goes well, we can also check the NoSQL database to verify that the contents has also been successfully restored.
$ kubectl get ns NAME STATUS AGE cassandra Active 25m cormac-guest-ns Active 40d default Active 84d kube-node-lease Active 84d kube-public Active 84d kube-system Active 84d music-system Active 84d velero Active 3d velero-vsphere-plugin-backupdriver Active 3d vmware-system-auth Active 84d vmware-system-cloud-provider Active 84d vmware-system-csi Active 84d $ kubectl get all -n cassandra NAME READY STATUS RESTARTS AGE pod/cassandra-0 1/1 Running 0 7m9s pod/cassandra-1 1/1 Running 3 7m9s pod/cassandra-2 1/1 Running 3 7m9s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cassandra ClusterIP 10.107.110.112 <none> 9042/TCP 7m8s NAME READY AGE statefulset.apps/cassandra 3/3 7m8s $ kubectl get pvc -n cassandra NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cassandra-data-cassandra-0 Bound pvc-3dd9ed26-0f97-408a-84c6-a3e89460abcd 1Gi RWO 25 cassandra-data-cassandra-1 Bound pvc-efbb2795-b0b1-4a8a-aaec-4bcca9118a08 1Gi RWO 17m cassandra-data-cassandra-2 Bound pvc-59567ae5-edbf-4246-ba5e-094997c26f1a 1Gi RWO 7m24s $ kubectl exec -it pod/cassandra-0 -n cassandra -- nodetool status Datacenter: DC1-TKG-Demo ======================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.2.8 131.18 KiB 32 100.0% 43b4a8c5-9055-4382-8c74-a1d543277742 Rack1-TKG-Demo UN 192.168.2.9 169.79 KiB 32 100.0% 15890386-5681-48ff-bba4-1d9218f23ed2 Rack1-TKG-Demo UN 192.168.1.12 162.12 KiB 32 100.0% 9ce483b2-9343-4254-bf04-87d41fac39fa Rack1-TKG-Demo $ kubectl exec -it pod/cassandra-0 -n cassandra -- cqlsh Connected to TKG-Demo at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4] Use HELP for help. cqlsh> use demodb; cqlsh:demodb> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal --------+----------+----------+-----------+--------- 100 | Cork | Cormac | 999 | 1000000 (1 rows) cqlsh:demodb>
This all looks good to me. We have been able to use the Velero vSphere Operator, the Velero Data Manager and the Velero vSphere Plugin to successfully backup and restore objects (using vSphere snapshots) in a TKG service “guest” cluster on vSphere with Tanzu.