A first look at Velero (previously known as Ark)

Those of you who work in the cloud native space will probably be aware of VMware’s acquisition of Heptio back in December 2018. Heptio bring much expertise and a number of products to the table, one of which I was very eager to try it. This is the Heptio Velero product, previously known as Heptio Ark. Heptio Velero provides a means to back up and restore cloud native applications. Interestingly enough, they appear to be able to capture all of the deployment details, so they are able to backup the pods (compute), persistent volumes (storage) and services (networking), as well as any other related objects, e.g. statefulset. They can back up or restore all objects in your Kubernetes cluster, or you can filter objects by types, namespaces, and/or labels. More details about how Velero works can be found in heptio.github.io.

Velero can be used with both cloud and on-prem deployments of K8s. These cloud providers (Amazon, Google, etc) have snapshot providers available for the creating snapshots of the application’s Persistent Volumes (PVs). On-prem is a little different. There is a provider available for Portworx. However, for everybody else, there is the open source “restic” utility which can be used for snapshotting PVs on other platforms, including vSphere.

Velero also ships with the ability to create/deploy a container based Minio object store for storing backups. Note that out of the box, this is very small, and won’t let you back up very much. I had to add a large PV to the Minio deployment to make it useful (I’ll show you how I did that shortly). The other issue I encountered with Minio is that the Velero backup/restore client (typically residing in a VM or on your dekstop) will not be able to communicate directly with the Minio object store. This means you can’t look at the logs stored in the object store from the velero client. Thus, a workaround is to change the Minio service from using a ClusterIP to use a NodePort instead. This will provide external access to the Minio object store. From that point, you can update one of the Minio configuration YAML files to tell the client about the external access to the Minio store. From that point, you can display the logs etc. I’ll show you how to do that as well in the post.

My K8s cluster was deployed via PKS. This also has a nuance (not a major one) which we will see shortly.

1. Creating a larger Minio object store

After doing some very small backups, I quickly ran out of space in the Minio object store.The symptoms were that my backup failed very quickly, and when I queried the podVolumeBackups object, I saw the following events:

cormac@pks-cli:~$ kubectl get podVolumeBackups -n velero
NAME                     AGE
couchbase-backup-c6bbl   4m
couchbase-backup-lsss6   3m
couchbase-backup-p9j4b   2m


cormac@pks-cli:~$ kubectl describe podVolumeBackups couchbase-backup-c6bbl -n velero
Name:         couchbase-backup-c6bbl
Namespace:    velero
.
.
Status:
  Message:  error running restic backup, stderr=Save(<lock/f476380dcd>) returned error, retrying after 679.652419ms: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 508.836067ms: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 1.640319969s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 1.337024499s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 1.620713255s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 4.662012875s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 7.092309877s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 6.33450427s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 13.103711682s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Save(<lock/f476380dcd>) returned error, retrying after 27.477605106s: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
Fatal: unable to create lock in backend: client.PutObject: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.
: exit status 1
  Path:
  Phase:        Failed
  Snapshot ID:
Events:         <none>
cormac@pks-cli:~$

Therefore to do anything meaningful, I modified the Minio deployment so that it had a 10GB PV for storing backups. To do it, I created a new StorageClass, a new PersistentVolumeClaim, and the modified the configuration in config/minio/00-minio-deployment.yaml to dynamically provision a PV rather than simply use a directory in the container. Since my deployment is on vSAN and using the vSphere Cloud Provider (VCP), I can also specify a storage policy for the PV. Note the namespace entry for the PVC. It has to be in Velero, along with everything else in the configuration for Velero. The full deployment instructions found at heptio.github.io.

1.1. New StorageClass

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: minio-sc
provisioner: kubernetes.io/vsphere-volume
parameters:
    diskformat: thin
    storagePolicyName: gold
    datastore: vsanDatastore

1.2. New PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-pv-claim-1
  namespace: velero
  annotations:
    volume.beta.kubernetes.io/storage-class: minio-sc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

1.3. config/minio/00-minio-deployment.yaml changes

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  namespace: velero
  name: minio
  labels:
    component: minio
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        component: minio
    spec:
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: minio-pv-claim-1
      - name: config
        emptyDir: {}

The change here is to change the volume “storage” from using an emptyDir to instead use a Persistent Volume. It is dynamically provisioned via the PersistentVolumeClaim entry. The StorageClass and PVC YAML can be added to the Minio configuration YAML to make things easier for any additional redeploys. Once the new configuration is in place, and once you redeploy the configuration, Minio will now have a 10GB PV available to place backups and allow you to do something meaningful.

2. Accessing Logs

Any time I tried to display backup logs, it would error and complain about not being able to find Minio. It would complain with a “dial tcp: lookup minio.velero.svc on 127.0.0.1:53: no such host” or something to that effect. This is because Minio is using ClusterIP and is not available externally. To resolve this, you can edit the service, and change the ClusterIP to a NodePort.

cormac@pks-cli:~/Velero$ kubectl edit svc minio -n velero

Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"component":"minio"},\
"name":"minio","namespace":"velero"},"spec":{"ports":[{"port":9000,"protocol":"TCP","targetPort":9000}],\
"selector":{"component":"minio"},"type":"ClusterIP"}}
  creationTimestamp: 2019-03-06T11:16:55Z
  labels:
    component: minio
  name: minio
  namespace: velero
  resourceVersion: "1799322"
  selfLink: /api/v1/namespaces/velero/services/minio
  uid: 59c59e95-4001-11e9-a7c8-005056a27540
spec:
  clusterIP: 10.100.200.169
  externalTrafficPolicy: Cluster
  ports:
  - nodePort: 30971
    port: 9000
    protocol: TCP
    targetPort: 9000
  selector:
    component: minio
  sessionAffinity: None
  type: NodePort   
status:
  loadBalancer: {}

After making the change, you can now see which port that Minio is available on externally.

cormac@pks-cli:~/Velero$ kubectl get svc -n velero
NAME    TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
minio   NodePort   10.100.200.169   <none>        9000:32166/TCP   59m

Now the Minio S3 object store is accessible externally. Point a browser at the IP address of a Kubernetes worker node, and tag on the port (in this case 32166) shown above and you should get access to the Minio object store interface. Now we need to change one of the configuration files for Velero/Minio so that the velero client can get access to it for displaying logs, etc. This file is config/minio/05-backupstoragelocation.yaml. Edit this file, and uncomment the last line for the publicUrl, and set it appropriately.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero
  config:
    region: minio
    s3ForcePathStyle: "true"
    s3Url: http://minio.velero.svc:9000
    # Uncomment the following line and provide the value of an externally
    # available URL for downloading logs, running Velero describe, and more.
    publicUrl: http://10.27.51.187:32166

Reapply the configuration using the command kubectl apply -f config/minio/05-backupstoragelocation.yaml. Now you should be able to access the logs using commands such as velero backups logs backup-name.

3. PKS interop – Privileged Containers

This was another issue that stumped me for a while. With PKS, through Pivotal Operations Manager, you create plans for the types of K8s clusters that you deploy. One of the options in a plan is to ‘Enable Privileged Containers’. Since this always stated ‘use with caution’, I always left it disabled. However, with Velero, and specifically with the restic portion (snapshots), I hit a problem when this was disabled. On trying to create a backup, I encountered these events on the restic daemon set:

cormac@pks-cli:~/Velero/config/minio$ kubectl describe ds restic -n velero
Name:           restic
Selector:       name=restic
.
.
Events:
  Type     Reason        Age   From                  Message
  ----     ------        ----  ----                  -------
  Warning  FailedCreate  49m   daemonset-controller  Error creating: pods "restic-v567s" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden
  Warning  FailedCreate  32m   daemonset-controller  Error creating: pods "restic-xv5v7" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden
  Warning  FailedCreate  15m   daemonset-controller  Error creating: pods "restic-vx2cb" is forbidden: pod.Spec.SecurityContext.RunAsUser is forbidden

So I went back to the Pivotal Ops Manager, went to the PKS tile, edited the plan, and enabled the checkbox for both ‘Enable Privileged Containers’ and ‘Disable DenyEscalatingExec’. I had to re-apply the PKS configuration, but once this completed (and I didn’t have to do anything with the K8s cluster btw, it is all taken care of by PKS), the restic pods were created successfully. Here are the buttons as they appear in the PKS tile > Plan.

OK – we are now ready to take a backup.

4. First Backup (CouchBase StatefulSet)

For my first backup, I wanted to take a StatefulSet – in my case, a CouchBase deployment that had been scaled out to 3 replicas. This means 3 pods and 3 PVs. Velero uses annotations to identify components for backup. Here I added special annotation to the volumes in the pods so that they can be identified by the restic utility for snapshotting. Note that this application is in its own namespace called “couchbase”.

cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-0 backup.velero.io/backup-volumes=couchbase-data
pod/couchbase-0 annotated
cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-1 backup.velero.io/backup-volumes=couchbase-data
pod/couchbase-1 annotated
cormac@pks-cli:~/Velero$ kubectl -n couchbase annotate pod/couchbase-2 backup.velero.io/backup-volumes=couchbase-data
pod/couchbase-2 annotated

Now I can run the backup:

cormac@pks-cli:~/Velero$ velero backup create couchbase
Backup request "couchbase" submitted successfully.
Run `velero backup describe couchbase` or `velero backup logs couchbase` for more details.

cormac@pks-cli:~/Velero$ velero backup describe couchbase --details
Name:         couchbase
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  <none>
Phase:  InProgress
Namespaces:
  Included:  *
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto
Label selector:  <none>
Storage Location:  default
Snapshot PVs:  auto
TTL:  720h0m0s
Hooks:  <none>
Backup Format Version:  1
Started:    <n/a>
Completed:  <n/a>
Expiration:  2019-04-05 13:25:25 +0100 IST
Validation errors:  <none>
Persistent Volumes: <none included>
Restic Backups:
  New:
    couchbase/couchbase-0: couchbase-data

cormac@pks-cli:~/Velero/$ velero backup get 
NAME        STATUS      CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR 
couchbase   Completed   2019-03-06 13:24:27 +0000 GMT   29d       default            <none>

At this point, it is also interesting to login to the Minio portal and see what has been backed up. Here is an overview from the top-most level – velero/restic/couchbase.

5. First Restore (CouchBase StatefulSet)

To do a real restore, let’s delete the couchbase namespace which has all of my application and data.

cormac@pks-cli:~/Velero/tests$ kubectl delete ns couchbase
namespace "couchbase" deleted

cormac@pks-cli:~/Velero/tests$ kubectl get ns
NAME          STATUS   AGE
default       Active   15d
kube-public   Active   15d
kube-system   Active   15d
pks-system    Active   15d
velero        Active   16h

Let’s now try to restore the delete namespace and all of its contents (pods, PVs, etc). Some of the more observant among you will notice that the backup name I am using is “cb2”, and not “couchbase”. That is only because I did a few different backup tests, and cb2 is the one I am restoring. You will specify your own backup name here.

cormac@pks-cli:~/Velero/tests$ velero restore create --from-backup cb2
Restore request "cb2-20190307090808" submitted successfully.
Run `velero restore describe cb2-20190307090808` or `velero restore logs cb2-20190307090808` for more details.

cormac@pks-cli:~/Velero/tests$ velero restore describe cb2-20190307090808
Name:         cb2-20190307090808
Namespace:    velero
Labels:       <none>
Annotations:  <none>
Backup:  cb2
Namespaces:
  Included:  *
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.ark.heptio.com, backups.velero.io, restores.ark.heptio.com, restores.velero.io
  Cluster-scoped:  auto
Namespace mappings:  <none>
Label selector:  <none>
Restore PVs:  auto
Phase:  InProgress
Validation errors:  <none>
Warnings:  <none>
Errors:    <none>

After a few moment, the restore completes. I can also see that all 3 of my PVs are referenced. Next step is to verify that the application has indeed recovered and is usable.

cormac@pks-cli:~/Velero/tests$ velero restore describe cb2-20190307090808 --details
Name:         cb2-20190307090808
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Backup:  cb2

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.ark.heptio.com, backups.velero.io, restores.ark.heptio.com, restores.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Phase:  Completed

Validation errors:  <none>

Warnings:  <none>
Errors:    <none>

Restic Restores:
  Completed:
    couchbase/couchbase-0: couchbase-data
    couchbase/couchbase-1: couchbase-data
    couchbase/couchbase-2: couchbase-data

Every thing appears to have come back successfully.

cormac@pks-cli:~/Velero/tests$ kubectl get ns
NAME          STATUS   AGE
couchbase     Active   70s
default       Active   15d
kube-public   Active   15d
kube-system   Active   15d
pks-system    Active   15d
velero        Active   16h

cormac@pks-cli:~/Velero/tests$ kubectl get pods -n couchbase
NAME          READY   STATUS    RESTARTS   AGE
couchbase-0   1/1     Running   0          90s
couchbase-1   1/1     Running   0          90s
couchbase-2   1/1     Running   0          90s

cormac@pks-cli:~/Velero/tests$ kubectl get pv -n couchbase
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-8d676074-4032-11e9-842d-005056a27540   10Gi       RWO            Delete           Bound    velero/minio-pv-claim-1                minio-sc                16h
pvc-9d9c537b-40b8-11e9-842d-005056a27540   1Gi        RWO            Delete           Bound    couchbase/couchbase-data-couchbase-0   couchbasesc             95s
pvc-9d9d7846-40b8-11e9-842d-005056a27540   1Gi        RWO            Delete           Bound    couchbase/couchbase-data-couchbase-1   couchbasesc             95s
pvc-9d9f60b9-40b8-11e9-842d-005056a27540   1Gi        RWO            Delete           Bound    couchbase/couchbase-data-couchbase-2   couchbasesc             91s

cormac@pks-cli:~/Velero/tests$ kubectl get svc -n couchbase
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
couchbase      ClusterIP      None            <none>        8091/TCP         117s
couchbase-ui   NodePort      10.100.200.80   <pending>     8091:31365/TCP    117s

I was also able to successfully connect to my CouchBase UI and examine the contents – all looked good to me. To close, I’d like to give a bit shout-out to my colleague Myles Gray, who fielded a lot of my K8s related questions. Cheer Myles!

[Update] OK, so I am not sure why it worked for me when I did my initial blog, but further attempts to restore this configuration resulted in CouchBase not starting. The reason for this seems to be that the IP addresses allocated to the CouchBase pods are hard-coded in either /opt/couchbase/var/lib/couchbase/ip or /opt/couchbase/var/lib/couchbase/ip_start on the pods. To get CouchBase back up and running, I had to replace the old entries (which were the IP addresses of the nodes which were backed up) with the new IP addresses of the restored pods. I ssh’ed onto the pods to do this. You can get the new IP address from /etc/hosts on each pod. Once the new entries were successfully updated, CouchBase restarted. However, it did not retain the original configuration, and seems to have reset to a default deployment. I’ll continue to reasearch into what else may need to be changed to bring the original config back.