A first look at Velero (previously known as Ark)
Those of you who work in the cloud native space will probably be aware of VMware’s acquisition of Heptio back in December 2018. Heptio bring much expertise and a number of products to the table, one of which I was very eager to try it. This is the Heptio Velero product, previously known as Heptio Ark. Heptio Velero provides a means to back up and restore cloud native applications. Interestingly enough, they appear to be able to capture all of the deployment details, so they are able to backup the pods (compute), persistent volumes (storage) and services (networking), as well as any other related objects, e.g. statefulset. They can back up or restore all objects in your Kubernetes cluster, or you can filter objects by types, namespaces, and/or labels. More details about how Velero works can be found in heptio.github.io.
Velero can be used with both cloud and on-prem deployments of K8s. These cloud providers (Amazon, Google, etc) have snapshot providers available for the creating snapshots of the application’s Persistent Volumes (PVs). On-prem is a little different. There is a provider available for Portworx. However, for everybody else, there is the open source “restic” utility which can be used for snapshotting PVs on other platforms, including vSphere.
Velero also ships with the ability to create/deploy a container based Minio object store for storing backups. Note that out of the box, this is very small, and won’t let you back up very much. I had to add a large PV to the Minio deployment to make it useful (I’ll show you how I did that shortly). The other issue I encountered with Minio is that the Velero backup/restore client (typically residing in a VM or on your dekstop) will not be able to communicate directly with the Minio object store. This means you can’t look at the logs stored in the object store from the velero client. Thus, a workaround is to change the Minio service from using a ClusterIP to use a NodePort instead. This will provide external access to the Minio object store. From that point, you can update one of the Minio configuration YAML files to tell the client about the external access to the Minio store. From that point, you can display the logs etc. I’ll show you how to do that as well in the post.
My K8s cluster was deployed via PKS. This also has a nuance (not a major one) which we will see shortly.
1. Creating a larger Minio object store
After doing some very small backups, I quickly ran out of space in the Minio object store.The symptoms were that my backup failed very quickly, and when I queried the podVolumeBackups object, I saw the following events:
cormac@pks-cli:~$ kubectl get podVolumeBackups -n velero NAME AGE couchbase-backup-c6bbl 4m couchbase-backup-lsss6 3m couchbase-backup-p9j4b 2m cormac@pks-cli:~$ kubectl describe podVolumeBackups couchbase-backup-c6bbl -n velero Name: couchbase-backup-c6bbl Namespace: velero . . Status: Message: error running restic backup, stderr=Save(<lock/f476380dcd>) returned error, retrying after 679.652419ms: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 508.836067ms: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 1.640319969s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 1.337024499s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 1.620713255s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 4.662012875s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 7.092309877s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 6.33450427s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 13.103711682s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Save(<lock/f476380dcd>) returned error, retrying after 27.477605106s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. Fatal: unable to create lock in backend: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed. : exit status 1 Path: Phase: Failed Snapshot ID: Events: <none> cormac@pks-cli:~$
Therefore to do anything meaningful, I modified the Minio deployment so that it had a 10GB PV for storing backups. To do it, I created a new StorageClass, a new PersistentVolumeClaim, and the modified the configuration in config/minio/00-minio-deployment.yaml to dynamically provision a PV rather than simply use a directory in the container. Since my deployment is on vSAN and using the vSphere Cloud Provider (VCP), I can also specify a storage policy for the PV. Note the namespace entry for the PVC. It has to be in Velero, along with everything else in the configuration for Velero. The full deployment instructions found at heptio.github.io.
1.1. New StorageClass
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: minio-sc provisioner: kubernetes.io/vsphere-volume parameters: diskformat: thin storagePolicyName: gold datastore: vsanDatastore
1.2. New PersistentVolumeClaim
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: minio-pv-claim-1 namespace: velero annotations: volume.beta.kubernetes.io/storage-class: minio-sc spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi
1.3. config/minio/00-minio-deployment.yaml changes
apiVersion: apps/v1beta1 kind: Deployment metadata: namespace: velero name: minio labels: component: minio spec: strategy: type: Recreate template: metadata: labels: component: minio spec: volumes: - name: storage persistentVolumeClaim: claimName: minio-pv-claim-1 - name: config emptyDir: {}
The change here is to change the volume “storage” from using an emptyDir to instead use a Persistent Volume. It is dynamically provisioned via the PersistentVolumeClaim entry. The StorageClass and PVC YAML can be added to the Minio configuration YAML to make things easier for any additional redeploys. Once the new configuration is in place, and once you redeploy the configuration, Minio will now have a 10GB PV available to place backups and allow you to do something meaningful.
2. Accessing Logs
Any time I tried to display backup logs, it would error and complain about not being able to find Minio. It would complain with a “dial tcp: lookup minio.velero.svc on 127.0.0.1:53: no such host” or something to that effect. This is because Minio is using ClusterIP and is not available externally. To resolve this, you can edit the service, and change the ClusterIP to a NodePort.
cormac@pks-cli:~/Velero$ kubectl edit svc minio -n velero
Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 kind: Service metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"component":"minio"},\ "name":"minio","namespace":"velero"},"spec":{"ports":[{"port":9000,"protocol":"TCP","targetPort":9000}],\ "selector":{"component":"minio"},"type":"ClusterIP"}} creationTimestamp: 2019-03-06T11:16:55Z labels: component: minio name: minio namespace: velero resourceVersion: "1799322" selfLink: /api/v1/namespaces/velero/services/minio uid: 59c59e95-4001-11e9-a7c8-005056a27540 spec: clusterIP: 10.100.200.169 externalTrafficPolicy: Cluster ports: - nodePort: 30971 port: 9000 protocol: TCP targetPort: 9000 selector: component: minio sessionAffinity: None type: NodePort status: loadBalancer: {}
After making the change, you can now see which port that Minio is available on externally.
cormac@pks-cli:~/Velero$ kubectl get svc -n velero NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE minio NodePort 10.100.200.169 <none> 9000:32166/TCP 59m
Now the Minio S3 object store is accessible externally. Point a browser at the IP address of a Kubernetes worker node, and tag on the port (in this case 32166) shown above and you should get access to the Minio object store interface. Now we need to change one of the configuration files for Velero/Minio so that the velero client can get access to it for displaying logs, etc. This file is config/minio/05-backupstoragelocation.yaml. Edit this file, and uncomment the last line for the publicUrl, and set it appropriately.
apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: default namespace: velero spec: provider: aws objectStorage: bucket: velero config: region: minio s3ForcePathStyle: "true" s3Url: http://minio.velero.svc:9000 # Uncomment the following line and provide the value of an externally # available URL for downloading logs, running Velero describe, and more. publicUrl: http://10.27.51.187:32166
Reapply the configuration using the command kubectl apply -f config/minio/05-backupstoragelocation.yaml. Now you should be able to access the logs using commands such as velero backups logs backup-name.
3. PKS interop – Privileged Containers
This was another issue that stumped me for a while. With PKS, through Pivotal Operations Manager, you create plans for the types of K8s clusters that you deploy. One of the options in a plan is to ‘Enable Privileged Containers’. Since this always stated ‘use with caution’, I always left it disabled. However, with Velero, and specifically with the restic portion (snapshots), I hit a problem when this was disabled. On trying to create a backup, I encountered these events on the restic daemon set:
cormac@pks-cli:~/Velero/config/minio$ kubectl describe ds restic -n velero Name: restic Selector: name=restic . . Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 49m daemonset-controller Error creating: pods "restic-v567s" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden Warning FailedCreate 32m daemonset-controller Error creating: pods "restic-xv5v7" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden Warning FailedCreate 15m daemonset-controller Error creating: pods "restic-vx2cb" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden
So I went back to the Pivotal Ops Manager, went to the PKS tile, edited the plan, and enabled the checkbox for both ‘Enable Privileged Containers’ and ‘Disable DenyEscalatingExec’. I had to re-apply the PKS configuration, but once this completed (and I didn’t have to do anything with the K8s cluster btw, it is all taken care of by PKS), the restic pods were created successfully. Here are the buttons as they appear in the PKS tile > Plan.
OK – we are now ready to take a backup.
4. First Backup (CouchBase StatefulSet)
For my first backup, I wanted to take a StatefulSet – in my case, a CouchBase deployment that had been scaled out to 3 replicas. This means 3 pods and 3 PVs. Velero uses annotations to identify components for backup. Here I added special annotation to the volumes in the pods so that they can be identified by the restic utility for snapshotting. Note that this application is in its own namespace called “couchbase”.
cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-0 backup.velero.io/backup-volumes=couchbase-data pod/couchbase-0 annotated cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-1 backup.velero.io/backup-volumes=couchbase-data pod/couchbase-1 annotated cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-2 backup.velero.io/backup-volumes=couchbase-data pod/couchbase-2 annotated
Now I can run the backup:
cormac@pks-cli:~/Velero$ velero backup create couchbase Backup request "couchbase" submitted successfully. Run `velero backup describe couchbase` or `velero backup logs couchbase` for more details. cormac@pks-cli:~/Velero$ velero backup describe couchbase --details Name: couchbase Namespace: velero Labels: velero.io/storage-location=default Annotations: <none> Phase: InProgress Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: <none> Storage Location: default Snapshot PVs: auto TTL: 720h0m0s Hooks: <none> Backup Format Version: 1 Started: <n/a> Completed: <n/a> Expiration: 2019-04-05 13:25:25 +0100 IST Validation errors: <none> Persistent Volumes: <none included> Restic Backups: New: couchbase/couchbase-0: couchbase-data
cormac@pks-cli:~/Velero/$ velero backup get NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR couchbase Completed 2019-03-06 13:24:27 +0000 GMT 29d default <none>
At this point, it is also interesting to login to the Minio portal and see what has been backed up. Here is an overview from the top-most level – velero/restic/couchbase.
5. First Restore (CouchBase StatefulSet)
To do a real restore, let’s delete the couchbase namespace which has all of my application and data.
cormac@pks-cli:~/Velero/tests$ kubectl delete ns couchbase namespace "couchbase" deleted
cormac@pks-cli:~/Velero/tests$ kubectl get ns NAME STATUS AGE default Active 15d kube-public Active 15d kube-system Active 15d pks-system Active 15d velero Active 16h
Let’s now try to restore the delete namespace and all of its contents (pods, PVs, etc). Some of the more observant among you will notice that the backup name I am using is “cb2”, and not “couchbase”. That is only because I did a few different backup tests, and cb2 is the one I am restoring. You will specify your own backup name here.
cormac@pks-cli:~/Velero/tests$ velero restore create --from-backup cb2 Restore request "cb2-20190307090808" submitted successfully. Run `velero restore describe cb2-20190307090808` or `velero restore logs cb2-20190307090808` for more details.
cormac@pks-cli:~/Velero/tests$ velero restore describe cb2-20190307090808 Name: cb2-20190307090808 Namespace: velero Labels: <none> Annotations: <none> Backup: cb2 Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, backups.velero.io, restores.ark.heptio.com, restores.velero.io Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto Phase: InProgress Validation errors: <none> Warnings: <none> Errors: <none>
After a few moment, the restore completes. I can also see that all 3 of my PVs are referenced. Next step is to verify that the application has indeed recovered and is usable.
cormac@pks-cli:~/Velero/tests$ velero restore describe cb2-20190307090808 --details Name: cb2-20190307090808 Namespace: velero Labels: <none> Annotations: <none> Backup: cb2 Namespaces: Included: * Excluded: <none> Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, backups.velero.io, restores.ark.heptio.com, restores.velero.io Cluster-scoped: auto Namespace mappings: <none> Label selector: <none> Restore PVs: auto Phase: Completed Validation errors: <none> Warnings: <none> Errors: <none> Restic Restores: Completed: couchbase/couchbase-0: couchbase-data couchbase/couchbase-1: couchbase-data couchbase/couchbase-2: couchbase-data
Every thing appears to have come back successfully.
cormac@pks-cli:~/Velero/tests$ kubectl get ns NAME STATUS AGE couchbase Active 70s default Active 15d kube-public Active 15d kube-system Active 15d pks-system Active 15d velero Active 16h cormac@pks-cli:~/Velero/tests$ kubectl get pods -n couchbase NAME READY STATUS RESTARTS AGE couchbase-0 1/1 Running 0 90s couchbase-1 1/1 Running 0 90s couchbase-2 1/1 Running 0 90s cormac@pks-cli:~/Velero/tests$ kubectl get pv -n couchbase NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-8d676074-4032-11e9-842d-005056a27540 10Gi RWO Delete Bound velero/minio-pv-claim-1 minio-sc 16h pvc-9d9c537b-40b8-11e9-842d-005056a27540 1Gi RWO Delete Bound couchbase/couchbase-data-couchbase-0 couchbasesc 95s pvc-9d9d7846-40b8-11e9-842d-005056a27540 1Gi RWO Delete Bound couchbase/couchbase-data-couchbase-1 couchbasesc 95s pvc-9d9f60b9-40b8-11e9-842d-005056a27540 1Gi RWO Delete Bound couchbase/couchbase-data-couchbase-2 couchbasesc 91s cormac@pks-cli:~/Velero/tests$ kubectl get svc -n couchbase NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE couchbase ClusterIP None <none> 8091/TCP 117s couchbase-ui NodePort 10.100.200.80 <pending> 8091:31365/TCP 117s
I was also able to successfully connect to my CouchBase UI and examine the contents – all looked good to me. To close, I’d like to give a bit shout-out to my colleague Myles Gray, who fielded a lot of my K8s related questions. Cheer Myles!
[Update] OK, so I am not sure why it worked for me when I did my initial blog, but further attempts to restore this configuration resulted in CouchBase not starting. The reason for this seems to be that the IP addresses allocated to the CouchBase pods are hard-coded in either /opt/couchbase/var/lib/couchbase/ip or /opt/couchbase/var/lib/couchbase/ip_start on the pods. To get CouchBase back up and running, I had to replace the old entries (which were the IP addresses of the nodes which were backed up) with the new IP addresses of the restored pods. I ssh’ed onto the pods to do this. You can get the new IP address from /etc/hosts on each pod. Once the new entries were successfully updated, CouchBase restarted. However, it did not retain the original configuration, and seems to have reset to a default deployment. I’ll continue to reasearch into what else may need to be changed to bring the original config back.