Velero vSphere Operator backup/restore TKG “guest” cluster objects in vSphere with Tanzu

Over the past week or so, I have posted a number of blogs on how to get started with the new Velero vSphere Operator. I showed how to deploy the Operator in the Supervisor Cluster of vSphere with Tanzu, and also how to install the Velero and Backupdriver components in the Supervisor. We then went on to take backups and do restores of both stateless (e.g. Nginx deployment) and stateful (e.g. Cassandra StatefulSet) which were running as PodVMs is a Supervisor cluster. In the latter post, we saw how the new Velero Data Manager acted as the interface between Velero, vSphere, and the backup destination. It contains both snapshot and data mover capabilities to move the data to the Velero backup destination, in our case a MinIO S3 Object Store bucket. I want to complete this series of blog posts with one final example, and this is to show how to back up and restore a stateful application from within a Tanzu Kubernetes Grid (TKG) “guest” cluster. This TKG cluster has been deployed by the TKG service (TKGs) to a namespace on vSphere with Tanzu.  Within that TKG cluster, I will run a Cassandra Stateful Set application, then back it up, delete it and restore it. My backup (and restore destination) is the same S3 object store bucket that I used when backing up objects in the Supervisor cluster.

Requirements

We have already implement most of the requirements for backing up TKG objects. The destination S3 Object Storage bucket has been put in place. In my case, I used the new feature available in VMware Cloud Foundation version 4.2, the vSAN Data Persistence platform (DPp). I used this feature to deploy a MinIO S3 Object storage and bucket to act as a backup destination. As mentioned, I have already deployed the Velero vSphere Operator on the vSphere with Tanzu Supervisor Cluster, as well as the Velero Data Manager. In fact, the only step that remains for me to do is to deploy the Velero and Backupdriver components in the TKG “guest” cluster. We did this previously for the Supervisor cluster, but now we must do it for the TKG guest cluster. Let’s do that next.

Installation

The installation of the Velero and Backup driver is much the same as a standard deployment in vanilla Kubernetes, with the only difference being that you must include the vSphere plugin. This is so Velero knows to use vSphere snapshots (rather than a restic file copy, for example). This blog site has many examples of deploying Velero, and there is even an example of how to deploy the vSphere plugin from last year here. Note that this install command is run from the context of the TKG guest cluster. For the purposes of keeping everything in one place, here is an example install command:

./velero install \
--provider aws \
--bucket backup-bucket\
--secret-file ./velero-minio-credentials \
--namespace velero \
--plugins "velero/velero-plugin-for-aws:v1.1.0","vsphereveleroplugin/velero-plugin-for-vsphere:1.1.0"\
--backup-location-config region=minio,s3ForcePathStyle="true",s3Url="http://20.0.05",publicUrl="http://20.0.0.5"\
--snapshot-location-config region=minio

Many of these entries will be familiar to you from when we did the install in the Supervisor cluster. Many of the parameters relate to the MinIO S3 Object Storage and bucket, which we are using as the backup destination. There are 2 plugins listed, one for how to utilize the S3 object store (aws) and the other is to utilize vSphere snapshots (vsphere). Once the installer is run, there should be 2 deployment objects (as well as 2 replicaSets and 2 Pods) in the velero namespace in the TKG guest cluster. One is for the Velero server component, whilst the other is for the backupdriver. Let’s take a closer look.

$ kubectl config get-contexts
CURRENT   NAME                       CLUSTER    AUTHINFO                                   NAMESPACE
          20.0.0.1                   20.0.0.1   wcp:20.0.0.1:administrator@vsphere.local
          cormac-ns                  20.0.0.1   wcp:20.0.0.1:administrator@vsphere.local   cormac-ns
          minio-domain-c8            20.0.0.1   wcp:20.0.0.1:administrator@vsphere.local   minio-domain-c8
*         tkg-cluster-vcf-w-tanzu    20.0.0.3   wcp:20.0.0.3:administrator@vsphere.local
          velero-ns                  20.0.0.1   wcp:20.0.0.1:administrator@vsphere.local   velero-ns
          velero-vsphere-domain-c8   20.0.0.1   wcp:20.0.0.1:administrator@vsphere.local   velero-vsphere-domain-c8


$ kubectl get ns
NAME                                 STATUS   AGE
cassandra                            Active   3d
cormac-guest-ns                      Active   40d
default                              Active   84d
kube-node-lease                      Active   84d
kube-public                          Active   84d
kube-system                          Active   84d
music-system                         Active   83d
velero                               Active   10m
velero-vsphere-plugin-backupdriver   Active   10m
vmware-system-auth                   Active   84d
vmware-system-cloud-provider         Active   84d
vmware-system-csi                    Active   84d

$ kubectl get all -n velero
NAME                                 READY   STATUS    RESTARTS   AGE
pod/backup-driver-5567f6ccfd-lzp76   1/1     Running   0          2d23h
pod/velero-8469fd4f65-gs4kv          1/1     Running   0          2d23h

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/backup-driver   1/1     1            1           2d23h
deployment.apps/velero          1/1     1            1           2d23h

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/backup-driver-5567f6ccfd   1         1         1       2d23h
replicaset.apps/velero-8469fd4f65          1         1         1       2d23h

Prep Cassandra for backup

Let’s add some entries to my Cassandra NoSQL Database so that when we try a restore, there is something to check (if you want to know how to populate Cassandra, search this site for cqlsh examples):

$ kubectl exec -it -n cassandra pod/cassandra-2 -- nodetool status
Datacenter: DC1-TKG-Demo
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.1.4  137.32 KiB  32           100.0%            15890386-5681-48ff-bba4-1d9218f23ed2  Rack1-TKG-Demo
UN  192.168.2.6  141.32 KiB  32           100.0%            43b4a8c5-9055-4382-8c74-a1d543277742  Rack1-TKG-Demo
UN  192.168.2.7  133.92 KiB  32           100.0%            9ce483b2-9343-4254-bf04-87d41fac39fa  Rack1-TKG-Demo


$ kubectl exec -it -n cassandra pod/cassandra-2 -- cqlsh
Connected to TKG-Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> use demodb;
cqlsh:demodb> select * from emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
    100 |     Cork |   Cormac |       999 | 1000000
(1 rows)

cqlsh:demodb> exit

Take a backup

Let’s kick off a backup which only backs up objects related to the Cassandra namespace:

$ velero backup create cassandra-tkg-cass-snap  --include-namespaces cassandra --snapshot-volumes
Backup request "cassandra-tkg-cass-snap" submitted successfully.
Run `velero backup describe cassandra-tkg-cass-snap` or `velero backup logs cassandra-tkg-cass-snap` for more details.


$ velero backup describe cassandra-tkg-cass-snap
Name:         cassandra-tkg-cass-snap
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.18.10+vmware.1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=18

Phase:  InProgress

Errors:    0
Warnings:  0

Namespaces:
  Included:  cassandra
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  true

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2021-03-08 09:21:27 +0000 GMT
Completed:  <n/a>

Expiration:  2021-04-07 10:21:27 +0100 IST

Estimated total items to be backed up:  15
Items backed up so far:                 0

Velero-Native Snapshots: <none included>

This should trigger a snapshot operation within the TKG cluster for each of the 3 volumes used by this Cassandra StatefulSet.

$ kubectl get snapshots -n cassandra
NAMESPACE   NAME                                        AGE
cassandra   snap-4f4b232e-0f1b-4bb9-a459-62d9251f2d10   44s
cassandra   snap-a255ce8b-7017-48a8-8611-1a46c796093f   34s
cassandra   snap-f7e58144-2bda-4830-b887-1e61745187e4   39s

And after some time, the backup will show up as completed.

$ velero backup get cassandra-tkg-cass-snap
NAME                         STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
cassandra-tkg-cass-snap      Completed         0        0          2021-03-08 09:21:27 +0000 GMT   30d       default            <none>

However, even though the status of the backup operation shows up as completed, this may not be the case. That is because the status is not taking into account whether or not the backup has been uploaded to the S3 object store bucket. In order to check that, we will have to change contexts from the TKG “guest” cluster, back to the Supervisor cluster. From here we can check the status of the snapshots as well as the uploads. As mentioned in the previous posts, the snapshots are available in the namespace of the objects being backed up, and the uploads are available in the velero namespace in the Supervisor cluster context:

$ kubectl-vsphere login --vsphere-username administrator@vsphere.local \
--server=https://20.0.0.1 --insecure-skip-tls-verify

Password: *****
Logged in successfully.

You have access to the following contexts:
   20.0.0.1
   cormac-ns
   minio-domain-c8
   velero-ns
   velero-vsphere-domain-c8

If the context you wish to use is not in this list, you may need to try
logging in again later, or contact your cluster administrator.

To change context, use `kubectl config use-context <workload name>`


$ kubectl get snapshots -n cormac-ns
NAMESPACE   NAME                                        AGE
cormac-ns   snap-cfda8beb-c38a-4406-afb3-0f50787b4e8b   93s
cormac-ns   snap-f53b2a12-ff8c-4f84-802f-ebfdf576c3b8   88s
cormac-ns   snap-f845ad16-f170-4b5e-917e-0d4eab98cf83   99s


$ kubectl get uploads -n velero-ns
NAMESPACE   NAME                                          AGE
velero-ns   upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99   97s
velero-ns   upload-405099b3-8e1b-4dff-9bbd-4251955608f3   92s
velero-ns   upload-959f62cb-0d97-44e7-aac0-8eb7b205796f   103s

You can use the describe command to monitor the Phase field to see how far they have progressed:

$ kubectl describe snapshots -n cormac-ns \
snap-cfda8beb-c38a-4406-afb3-0f50787b4e8b \
snap-f53b2a12-ff8c-4f84-802f-ebfdf576c3b8 \
snap-f845ad16-f170-4b5e-917e-0d4eab98cf83 | grep -i phase
        f:phase:
  Phase:                 Uploaded
        f:phase:
  Phase:                 Uploaded
        f:phase:
  Phase:                 Uploaded

$ kubectl describe uploads -n velero-ns \
upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99 \
upload-405099b3-8e1b-4dff-9bbd-4251955608f3 \
upload-959f62cb-0d97-44e7-aac0-8eb7b205796f | grep -i phase
        f:phase:
  Phase:                 New
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 Completed

$ kubectl describe uploads -n velero-ns \
upload-057d9fbc-d82b-4239-aa6a-a74f4dff0c99 \
upload-405099b3-8e1b-4dff-9bbd-4251955608f3 \
upload-959f62cb-0d97-44e7-aac0-8eb7b205796f | grep -i phase
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 Completed

A status of New means the upload has not yet completed. Once all uploads have completed for each snapshot, it is now stored in the S3 Object Store bucket, so the backup is now complete.

Delete previously backed up objects

Now that the backup is taken, let’s see if I can restore it. Before doing that, I am going to remove the Cassandra NoSQL deployment that I just backed up.

$ kubectl get all -n cassandra
NAME              READY   STATUS    RESTARTS   AGE
pod/cassandra-0   1/1     Running   0          3d
pod/cassandra-1   1/1     Running   0          3d
pod/cassandra-2   1/1     Running   0          3d

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/cassandra   ClusterIP   10.107.129.124   <none>        9042/TCP   3d1h

NAME                         READY   AGE
statefulset.apps/cassandra   3/3     3d

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
cassandra-data-cassandra-0   Bound    pvc-86599488-79ae-46a8-a6b4-c04c425d8ae0   1Gi        RWO            vsan-default-storage-policy   3d1h
cassandra-data-cassandra-1   Bound    pvc-3b8aee0e-2341-42f2-aa8b-fc3e0345fac7   1Gi        RWO            vsan-default-storage-policy   3d
cassandra-data-cassandra-2   Bound    pvc-8ae40ba4-6650-4872-a2bc-d3877909751f   1Gi        RWO            vsan-default-storage-policy   3d

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS                  REASON   AGE
pvc-2ebbd21d-34b8-4287-a68f-747407506636   5Gi        RWO            Delete           Bound    default/block-pvc                      vsan-default-storage-policy            40d
pvc-3b8aee0e-2341-42f2-aa8b-fc3e0345fac7   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   vsan-default-storage-policy            3d
pvc-86599488-79ae-46a8-a6b4-c04c425d8ae0   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   vsan-default-storage-policy            3d1h
pvc-8ae40ba4-6650-4872-a2bc-d3877909751f   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   vsan-default-storage-policy            3d

$ kubectl delete statefulset.apps/cassandra -n cassandra
statefulset.apps "cassandra" deleted

$ kubectl delete pvc cassandra-data-cassandra-0 cassandra-data-cassandra-1 cassandra-data-cassandra-2 -n cassandra
persistentvolumeclaim "cassandra-data-cassandra-0" deleted
persistentvolumeclaim "cassandra-data-cassandra-1" deleted
persistentvolumeclaim "cassandra-data-cassandra-2" deleted

$ kubectl get all -n cassandra
NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/cassandra   ClusterIP   10.107.129.124   <none>        9042/TCP   3d1h

$ kubectl get pvc -n cassandra
No resources found in cassandra namespace.

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM               STORAGECLASS                  REASON   AGE
pvc-2ebbd21d-34b8-4287-a68f-747407506636   5Gi        RWO            Delete           Bound    default/block-pvc   vsan-default-storage-policy            40d

$ kubectl delete ns cassandra
namespace "cassandra” deleted

Restore the backup

Now we can try to restore the previous backup taken of my Cassandra namespace.

$ velero backup get cassandra-tkg-cass-snap
NAME                        STATUS      ERRORS   WARNINGS   CREATED              
cassandra-tkg-cass-snap     Completed   0        0          2021-03-08 09:21:27 +0000 GMT   29d       default            <none>

$ velero restore create cassandra-tkg-cass-snap-restore --from-backup cassandra-tkg-cass-snap
Restore request "cassandra-tkg-cass-snap-restore" submitted successfully.
Run `velero restore describe cassandra-tkg-cass-snap-restore` or `velero restore logs cassandra-tkg-cass-snap-restore` for more details.

$ velero restore get
NAME                              BACKUP                    STATUS       STARTED                         COMPLETED   ERRORS   WARNINGS   CREATED                         SELECTOR
cassandra-tkg-cass-snap-restore   cassandra-tkg-cass-snap   InProgress   2021-03-08 12:22:21 +0000 GMT   <nil>       0        0          2021-03-08 12:22:11 +0000 GMT   <none>

$ velero restore describe cassandra-tkg-cass-snap-restore
Name:         cassandra-tkg-cass-snap-restore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  InProgress

Started:    2021-03-08 12:22:21 +0000 GMT
Completed:  <n/a>

Backup:  cassandra-tkg-cass-snap

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Now, for the restore to complete, the snapshot data that was previously uploaded to the S3 object store bucket must now be downloaded. This can once again be viewed from the velero namespace (in my setup, called velero-ns) in the Supervisor cluster by switching contexts. Note that these downloads are done sequentially rather than in parallel, so we will have to wait for all 3 backups to be downloaded and restored, one for each volume in the Cassandra StatefulSet. Again, from the describe option, we can monitor for the download Phase to change to Completed.

$ kubectl get downloads -n velero-ns
NAME                                                                                 AGE
download-057d9fbc-d82b-4239-aa6a-a74f4dff0c99-e84fe5e8-4662-403c-8c0a-79371cfb4a87   93s
download-959f62cb-0d97-44e7-aac0-8eb7b205796f-00c3d228-d2af-441f-a803-e26a58ec98a8   2m6s


$ kubectl get downloads -n velero-ns
NAME                                                                                 AGE
download-057d9fbc-d82b-4239-aa6a-a74f4dff0c99-e84fe5e8-4662-403c-8c0a-79371cfb4a87   12m
download-405099b3-8e1b-4dff-9bbd-4251955608f3-c10d2608-03e4-4e09-92ed-8150216c7aba   4m2s
download-959f62cb-0d97-44e7-aac0-8eb7b205796f-00c3d228-d2af-441f-a803-e26a58ec98a8   12m

$ kubectl describe downloads -n velero-ns | grep -i phase
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 New
        f:phase:
  Phase:                 Completed

$ kubectl describe downloads -n velero-ns | grep -i phase
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 Completed
        f:phase:
  Phase:                 Completed

Once all download Phases have a status of Completed, the restore should also be complete. We can change contexts back to the TKG cluster and see if the Cassandra namespace, and all related objects have been successfully restored. Assuming the restore goes well, we can also check the NoSQL database to verify that the contents has also been successfully restored.

$ kubectl get ns
NAME                                 STATUS   AGE
cassandra                            Active   25m
cormac-guest-ns                      Active   40d
default                              Active   84d
kube-node-lease                      Active   84d
kube-public                          Active   84d
kube-system                          Active   84d
music-system                         Active   84d
velero                               Active   3d
velero-vsphere-plugin-backupdriver   Active   3d
vmware-system-auth                   Active   84d
vmware-system-cloud-provider         Active   84d
vmware-system-csi                    Active   84d

$ kubectl get all -n cassandra
NAME              READY   STATUS    RESTARTS   AGE
pod/cassandra-0   1/1     Running   0          7m9s
pod/cassandra-1   1/1     Running   3          7m9s
pod/cassandra-2   1/1     Running   3          7m9s

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/cassandra   ClusterIP   10.107.110.112   <none>        9042/TCP   7m8s

NAME                         READY   AGE
statefulset.apps/cassandra   3/3     7m8s


$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-3dd9ed26-0f97-408a-84c6-a3e89460abcd   1Gi        RWO                           25
cassandra-data-cassandra-1   Bound    pvc-efbb2795-b0b1-4a8a-aaec-4bcca9118a08   1Gi        RWO                           17m
cassandra-data-cassandra-2   Bound    pvc-59567ae5-edbf-4246-ba5e-094997c26f1a   1Gi        RWO                           7m24s

$ kubectl exec -it pod/cassandra-0 -n cassandra -- nodetool status
Datacenter: DC1-TKG-Demo
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.2.8   131.18 KiB  32           100.0%            43b4a8c5-9055-4382-8c74-a1d543277742  Rack1-TKG-Demo
UN  192.168.2.9   169.79 KiB  32           100.0%            15890386-5681-48ff-bba4-1d9218f23ed2  Rack1-TKG-Demo
UN  192.168.1.12  162.12 KiB  32           100.0%            9ce483b2-9343-4254-bf04-87d41fac39fa  Rack1-TKG-Demo

$ kubectl exec -it pod/cassandra-0 -n cassandra -- cqlsh
Connected to TKG-Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> use demodb;
cqlsh:demodb> select * from emp;

emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
    100 |     Cork |   Cormac |       999 | 1000000
(1 rows)
cqlsh:demodb>

This all looks good to me. We have been able to use the Velero vSphere Operator, the Velero Data Manager and the Velero vSphere Plugin to successfully backup and restore objects (using vSphere snapshots) in a TKG service “guest” cluster on vSphere with Tanzu.