Kubernetes Storage on vSphere 101 – StatefulSet

In my last post we looked at creating a highly available application that used multiple Pods in Kubernetes with Deployments and ReplicaSets. However, this was only focused on Pods.  In this post, we will look at another way of creating highly available applications through the use of StatefulSets. The first question you will probably have is what is the difference between a Deployment (with ReplicaSets) and a StatefulSet. From a high level perspective, conceptually we can consider that the major difference is that a Deployment is involved in maintaining the desired number of Pods available for an application, whereas a StatefulSet is involved in maintaining the desired number of Pods as well as storage in the form of persistent volumes (PVs) available. I’m obviously simplifying for the purposes of this 101 discussion. There are some other differences which we will get to later.

The next question is when would you use one over the other? Well, lets say you had a stateless application where you did not need external storage, or indeed an application where all Pods wrote the the same ReadWriteMany shared external storage, such as an NFS file share. In these cases you would not need to manage any volumes on behalf of the application, since all Pods access the same storage. You would only need to manage the Pods, using an object that ensures the desired number of Pods are running. For such as application, as we saw in the previous post, you could use a Deployment object with ReplicaSets. This will try to ensure that the correct number of Pods desired by the application are available.

Now, if a distributed application has built in replication features, for example a NoSQL database like Cassandra, each Pod would probably require its own storage. With such an application, as you scaled out the Pods, you would also want to scale out the storage. This would be achieved by instantiating a new and unique persistent volume (PV) for each Pod. Since these applications have their own built in replication techniques to make them highly available and survive outages, should a Pod go down (impacting part of this application), the remaining Pods continue to run the application since they have their own unique full copies of the replicated data, and so the application can remain online and available. The StatefulSet will attempt to maintain the correct number of replicas (in this case Pods + PVs) to ensure that the application can self-heal. We will talk about failures and how storage handles such issues in another post. Suffice to say that we can simplify the difference between Deployments+ReplicaSets and StatefulSet by stating that a Deployment+ReplicaSet is used for maintaining a desired number of Pods, and a StatefulSet can be use for maintaining a desired number of both Pods and PVs.

So how does a StatefulSet create PVs and PVCs on the fly? It does it through the use of a volumeClaimTemplate in its manifest YAML file. This is where you add the reference to the StorageClass and the specification of the volume you wish to create. This is then included as a volumeMount for a container within the Pod. On applying the manifest YAML for the StatefulSet, you should observe the Pods and PVCs getting created and named with an incremental numeric sequence. Obviously, the StorageClass that is referenced by the StatefulSet will need to exist for the PVC creation to work.

In the upcoming example, I will deploy a 3 node Cassandra DB as a StatefulSet. One thing that needs to exist for this application to work is a service that will allow the different nodes to communicate with each other. In this example, I am using a headless service. This is created with a ClusterIP type set to None, but allows each of the Pods to communicate using a DNS name. Services are beyond the scope of this discussion, but suffice to say that this is necessary to allow the Cassandra nodes to form their own cluster and replicate their data.

Let’s start the demo by creating the StorageClass. Here is the manifest I am using.

kind: StorageClass
  apiVersion: storage.k8s.io/v1
  metadata:
    name: cass-sc
  provisioner: kubernetes.io/vsphere-volume
  parameters:
      storagePolicyName: raid-1
      datastore: vsanDatastore

There should be nothing very new here. Of note are the parameters where we are specifying a storage policy of “raid-1”, which means that any Persistent Volumes created using this StorageClass will instantiate a RAID-1 mirrored virtual machine disk (VMDK) on my vSAN datastore. Have a look back at the 101 StorageClass post if you need a refresher.

Next thing we should start is the headless Service, so that the nodes in the Cassandra application can communicate. Here is the manifest for the very simple headless service. I have named the Service “cassandra”.

apiVersion: v1 
kind: Service 
metadata: 
  labels:
     app: cassandra 
  name: cassandra 
  namespace: cassandra 
spec: 
  clusterIP: None 
  selector: 
    app: cassandra

I am going to deploy this app in its own K8s namespace, called cassandra. Thus, in the Service and later on in the StatefulSet, there is a metadata.namespace entry pointing to that namespace. Let’s create that new namespace, and deploy both the StorageClass and Service before we start taking a look at the StatefulSet for my Cassandra application.

$ kubectl create ns cassandra
namespace/cassandra created

$ kubectl get ns
NAME STATUS AGE
cassandra Active 7s
default Active 4d23h
kube-public Active 4d23h
kube-system Active 4d23h
pks-system Active 4d23h

$ kubectl create -f cassandra-sc.yaml
storageclass.storage.k8s.io/cass-sc created

$ kubectl get sc
NAME          PROVISIONER                    AGE
cass-sc   kubernetes.io/vsphere-volume   8s

$ kubectl create -f headless-cassandra-service.yaml
service/cassandra created

$ kubectl get svc
NAME           TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
cassandra     ClusterIP   None         <none>        <none>    6

OK – that’s StorageClass and headless Service taken care of. Now onto the main event, the Cassandra StatefulSet. This is the most complex YAML file that we have looked at so far. The reason for it being so complex is that it includes quite a bit of detail around resources and environmental settings for the Cassandra application. Therefore, I am going to chunk it up a bit and just review it in two parts. Let’s take a look at some entries that we should already be somewhat familiar with. We will fill in the blanks later on.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
  namespace: cassandra
  labels:
    app: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      containers:
      - name: cassandra
        image: gcr.io/google-samples/cassandra:v11
        ports:
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        - containerPort: 7199
          name: jmx
        - containerPort: 9042
          name: cql
.
<snip>
.
        volumeMounts:
        - name: cassandra-data
          mountPath: /cassandra_data
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
      annotations:
        volume.beta.kubernetes.io/storage-class: cass-sc
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

OK, let’s talk about the above. It is a StatefulSet and in the spec.serviceName, we specify our headless service created earlier. We have asked for 3 replicas via spec.replicas, and in this case, this will instantiate 3 x Pods and 3 x PVCs using the StorageClass specified in volumeClaimTemplate-metadata.annotation.volume.beta.kubernetes.io/storage-class. The volumes will be ReadWriteOnce, and 1GiB in size. These will then be mounted onto /cassandra_data in each container, as per spec.template.spec.volumeMounts, where the mount name cassandra_data matches the name of the volume claim. I am pulling the v11 Cassandra image as that has built in cqlsh which can be used for create tables, etc. Feel free to use later versions if you wish.

Now, lets take a look at the application specific stuff, which is what I snipped out of the above manifest. Remember this is Cassandra specific stuff, so don’t worry too much about it. It is not necessary to understand these details if you want to understand the concept of a StatefulSet. This block of code appears immediately after the ports section above, and just before the volume mounts. Remember to keep the spaces before each of the entries, or else the YAML file won’t get parsed correctly.

        resources:
          limits:
            cpu: "500m"
            memory: 1Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        securityContext:
          capabilities:
            add:
              - IPC_LOCK
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - nodetool drain
        env:
          - name: MAX_HEAP_SIZE
            value: 512M
          - name: HEAP_NEWSIZE
            value: 100M
          - name: CASSANDRA_SEEDS
            value: "cassandra-0.cassandra.cassandra.svc.cluster.local"
          - name: CASSANDRA_CLUSTER_NAME
            value: "K8Demo"
          - name: CASSANDRA_DC
            value: "DC1-K8Demo"
          - name: CASSANDRA_RACK
            value: "Rack1-K8Demo"
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - /ready-probe.sh
          initialDelaySeconds: 15
          timeoutSeconds: 5

The resources section should be fairly self explanatory, where we are requesting a limit for CPU and memory for each of the Pods. The lifecycle sections states that a Cassandra CLI tool called nodetool should be invoked to drain a node when it is stopped. There are then a bunch of environment variables passed in the env section. The important one here is the CASSANDRA_SEEDS which is the DNS name of the first node. It reflects the first host name (cassandra-0), the service name (cassandra) and the namespace name (again, cassandra). Other nodes will connect to this first node to form a cluster, so if you have different service or namespace names, this variable will need to be modified or the hosts won’t be able to join the cluster. Finally, there is a readinessProbe which runs a script to check that everything is working.

If we put this YAML file together and deploy it, we should see a StatefulSet get rolled out which contains 3 Pods, each Pod will have its own clearly identifiable PVC, and each PVC will dynamically request a PV to be created. This PV will be a VMDK on the vSphere vSAN datastore, and will have a “raid-1” policy, meaning the PVs will be mirrored on the vSAN datastore. This is taken from the StorageClass, as seen earlier. Let’s give it a go, keeping in mind that we are now working in the cassandra namespace, and so each kubectl command should reference that namespace. Some objects are global, such as PVs and StorageClasses, so these do not to be queried explicitly by namespace. We will start by showing that there are no Pods, PVCs or PVs, and then deploy the StatefulSet.

$ kubectl get sc
NAME      PROVISIONER                    AGE
cass-sc   kubernetes.io/vsphere-volume   72m

$ kubectl get svc -n cassandra
NAME        TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
cassandra   ClusterIP   None         <none>        <none>    35s

$ kubectl get pods -n cassandra
No resources found.

$ kubectl get pvc -n cassandra
No resources found.

$ kubectl get pv
No resources found.

$ kubectl get sts -n cassandra
No resources found.
$ kubectl create -f cassandra-statefulset-orig.yaml
statefulset.apps/cassandra create
Let’s now monitor the creation of the various Pods, PVCs, and PVs as the StatefulSet comes online.
$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   3         1         26s

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Running   0          35s

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        42s

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 33

And after a few minutes in my environment, the full StatefulSet is online with all of the necessary Pods, PVCs and PVs, the latter of which have been instantiated on the fly as the StatefulSet requires.

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   3         3         3m44s

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          13m
cassandra-1   1/1     Running   0          12m
cassandra-2   1/1     Running   0          10m

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        3m51s
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        2m53s
cassandra-data-cassandra-2   Bound    pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        60s

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 3m44s
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 2m44s
pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   cass-sc                 61s

And we can also check the application as well, using the nodetool CLI utility mentioned previously. You can see how some of those environment variables passed in from the YAML manifest have been utilized, e.g. Datacenter, Rack.

$ kubectl exec -it cassandra-0 -n cassandra nodetool status
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.200.40.8   76.01 KiB  32           58.4%             3d4106bb-d716-4008-b5fc-f89c1af9bed9  Rack1-K8Demo
UN  10.200.41.10  95.05 KiB  32           73.5%             94217c7c-310a-4ab6-8a09-2369d56a8691  Rack1-K8Demo
UN  10.200.16.38  104.4 KiB  32           68.1%             afa03459-bfb9-4399-b35f-a0cd57ca4ebf  Rack1-K8Demo

Now, it should be quite obvious how different the StatefulSet is to the Deployment that we saw in an earlier post. The Pods are named in a consistent fashion, as are the PVCs (which use a combination of the Pods and volume names). We can tell which Pod was started first (cassandra-0), which is important for an application like Cassandra, as it means you can tell all the other Pods where they should join to when they start up (remember the CASSANDRA_SEEDS setting from earlier). And unlike Deployments, a StatefulSet will maintain the correct number of Pods and PVs. Let’s verify that by doing a scale-out test on the StatefulSet.

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          21m
cassandra-1   1/1     Running   0          20m
cassandra-2   1/1     Running   0          18m

$ kubectl scale sts cassandra --replicas=4 -n cassandra
statefulset.apps/cassandra scaled

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          22m
cassandra-1   1/1     Running   0          21m
cassandra-2   1/1     Running   0          19m
cassandra-3   0/1     Pending   0          3s

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        22m
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        21m
cassandra-data-cassandra-2   Bound    pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19m
cassandra-data-cassandra-3   Bound    pvc-2e09ab44-879b-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        13s

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 22m
pvc-2e09ab44-879b-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-3   cass-sc                 10s
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 21m
pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   cass-sc                 19m

$ kubectl get pods -n cassandra
NAME          READY   STATUS              RESTARTS   AGE
cassandra-0   1/1     Running             0          22m
cassandra-1   1/1     Running             0          21m
cassandra-2   1/1     Running             0          19m
cassandra-3   0/1     ContainerCreating   0          23s

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          24m
cassandra-1   1/1     Running   0          23m
cassandra-2   1/1     Running   0          21m
cassandra-3   1/1     Running   0          2m17s

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   4         4         26m

Notice also how the naming conventions are maintained, both for the Pods and the PVCs. If we scale back down our Cassandra deployment from 4 Pods to 3, 2 or even 1, the Pods that gets removed are the ones with the highest number. Pod 0 is the first Pod created, and is also the last one to get removed – you can see this in the scaling demo next. And relationships between Pods and PVs are also easy to identify. Another thing to note is that even when we scale back the application, the PVCs and PVs are not removed. This is by design to protect your data. So if you do want to remove Persistent Volumes that are no longer used by Pods, this will have to be done manually. Fortunately, due to the naming convention, it is easy to identify which PVCs (and thus PVs) to remove.

$ kubectl scale sts cassandra --replicas=2 -n cassandra
statefulset.apps/cassandra scaled

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   2         3         31m

$ kubectl get pods -n cassandra
NAME          READY   STATUS        RESTARTS   AGE
cassandra-0   1/1     Running       0          31m
cassandra-1   1/1     Running       0          30m
cassandra-2   1/1     Terminating   0          28m

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          31m
cassandra-1   1/1     Running   0          30m

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   2         2         31m

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        31m
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        30m
cassandra-data-cassandra-2   Bound    pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        28m
cassandra-data-cassandra-3   Bound    pvc-2e09ab44-879b-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        9m49s

$ kubectl get pv -n cassandra
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 31m
pvc-2e09ab44-879b-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-3   cass-sc                 9m50s
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 30m
pvc-811c8141-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   cass-sc                 29m

$ kubectl delete pvc cassandra-data-cassandra-2 cassandra-data-cassandra-3 -n cassandra
persistentvolumeclaim "cassandra-data-cassandra-2" deleted
persistentvolumeclaim "cassandra-data-cassandra-3" deleted

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        33m
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        32m

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 32m
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 31m

OK – we have successfully scaled up and back down the application, and we can see by the numbering of the Pods that it has worked as expected. Note that you will also need to take care of the Cassandra application at this point. It will report that the Cassandra hosts that were running on those Pods are now marked as “DN” – down. You’ll have to do some cleanup with the “nodetool removenode” to make Cassandra healthy once again. You will need to do this if you wish to scale the StatefulSet once more.

As a final step, lets see how ‘failures’ are handled by StatefulSets. At this point, I have cleaned up Cassandra after the scale tests, and have scaled back out to 3 hosts. Lets check out application first.

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   3         3         19h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          19h
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19h
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19h
cassandra-data-cassandra-2   Bound    pvc-890fbbbc-879d-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        18h

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 19h
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 19h
pvc-890fbbbc-879d-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   cass-sc                 18

Let’s do the first test, and delete a Pod.

$ kubectl delete pod cassandra-0 -n cassandra
pod "cassandra-0" deleted

$ kubectl get pods -n cassandra
NAME          READY   STATUS              RESTARTS   AGE
cassandra-0   0/1     ContainerCreating   0          5s
cassandra-1   1/1     Running             0          19h
cassandra-2   1/1     Running             0          18h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Running   0          19s
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          51s
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

The Pod was recreated and started after deletion, as we would expect. Since the PVC/PV were not removed, the newly created Pod mounted the existing PV, which it can easily identify by the PVC. OK – let’s try to do the same thing with a PVC. What you should notice is that the delete command will not complete, and the PVC will be left with a status of “Terminating” indefinitely. This is because K8s knows that the PVC cassandra-data-cassandra-0 is being used by the Pod cassandra-0, so it will not remove it.

$ kubectl delete pvc cassandra-data-cassandra-0 -n cassandra
persistentvolumeclaim "cassandra-data-cassandra-0" deleted
<-- stays here indefinitely -->

$ kubectl get pvc -n cassandra
NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Terminating   pvc-1b87e0e6-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19h
cassandra-data-cassandra-1   Bound         pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19h
cassandra-data-cassandra-2   Bound         pvc-890fbbbc-879d-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        18h

In order to remove the PVC, you will have to remove the Pod. Once the Pod is removed, there is no longer any dependency on the PVC. Thus it, and the associated PV can now be removed. However, this leaves you with a bit of an issue. When the Pod is restarted, which it will be since we have asked to have 3 replicas in the StatefulSet, the Pod can no longer be scheduled since there is no longer a PVC for it to use – we just deleted it. You will see the Pod cassandra-0 in a Pending state, and the following events associated with it if you describe the Pod:

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Pending   0          23s
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   187        18h
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  116s (x4 over 116s)  default-scheduler  persistentvolumeclaim "cassandra-data-cassandra-0" not found

So the obvious next question is how to fix this. The easiest way is to build a new PVC manifest. You can get a good ideas of what the entries should be in the YAML by running a “kubectl get pvc cassandra-data-cassandra-1 -n cassandra -o json” against any of the other PVCs. In my demo, the PVC for cassandra-0 would look something like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cassandra-data-cassandra-0
spec:
  storageClassName: cass-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Then I simply create the missing PVC. I should see a new PVC and corresponding PV created. This then means that the Pod will now get scheduled since the PVC is back in place, and my StatefulSet returns to full health.

$ kubectl create -f cassandra-data-cassandra-0-pvc.yaml
persistentvolumeclaim/cassandra-data-cassandra-0 created

$ kubectl get pvc -n cassandra
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-6213822a-883b-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        13s
cassandra-data-cassandra-1   Bound    pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        19h
cassandra-data-cassandra-2   Bound    pvc-890fbbbc-879d-11e9-ac8b-005056a2c144   1Gi        RWO            cass-sc        18h

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-3defb27e-8798-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-1   cass-sc                 19h
pvc-6213822a-883b-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 13s
pvc-890fbbbc-879d-11e9-ac8b-005056a2c144   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-2   cass-sc                 18h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Pending   0          14m
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Running   0          15m
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

$ kubectl get pods -n cassandra
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          15m
cassandra-1   1/1     Running   0          19h
cassandra-2   1/1     Running   0          18h

$ kubectl get sts -n cassandra
NAME        DESIRED   CURRENT   AGE
cassandra   3         3         19h

The main thing to highlight about that last exercise is that a PVC cannot be removed while a Pod is using the claim. Similarly, you cannot delete a PV which is bound to a PVC. This should stop you doing something silly to your application. And that just about does it for the 101 series. You should now have a decent understanding of PVs, PVCs, StorageClasses, Deployments and ReplicaSets, and now StatefulSets when using K8s on vSphere storage. The next item I want to tackle are failure events, and what is supposed to happen when something fails. That will take a little more work and much testing to figure out. Check back soon.

Manifests used in this demo can be found on my vsphere-storage-101 github repo.