Kubernetes Storage on vSphere 101 - Failure Scenarios

We have looked at quite a few scenarios when Kubernetes is running on vSphere, and what that means for storage. We looked at PVs, PVC, PODs, Storage Classes, Deployments and ReplicaSets, and most recently we looked at StatefulSets. In a few of the posts we looked at some controlled failures, for example, when we deleted a Pod from a Deployment or from a StatefulSet. In this post, I wanted to look a bit closer at an uncontrolled failure, say when a node crashes. However, before getting into this in too much details, it is worth highlighting a few of the core components of Kubernetes before we get started.

Core Components

In a typical K8s deployment, the majority of the components that I will be describing appear in the namespace kube-system. Here is a recent deployment of K8s 1.14.3 on my vSphere environment. Let’s get a list of the pods first, so we can see what we are talking about.

$ kubectl get pods -n kube-system

NAME READY STATUS RESTARTS AGE

coredns-fb8b8dccf-7s4vg 1/1 Running 0 2d1h

coredns-fb8b8dccf-cwbxs 1/1 Running 0 2d1h

etcd-cor-k8s-m01 1/1 Running 0 2d1h

kube-apiserver-cor-k8s-m01 1/1 Running 0 2d1h

kube-controller-manager-cor-k8s-m01 1/1 Running 0 2d1h

kube-flannel-ds-amd64-hgrww 1/1 Running 0 2d1h

kube-flannel-ds-amd64-j9nck 1/1 Running 0 2d1h

kube-flannel-ds-amd64-n5d28 1/1 Running 0 2d1h

kube-proxy-2gqlj 1/1 Running 0 2d1h

kube-proxy-hc494 1/1 Running 0 2d1h

kube-proxy-hsvdz 1/1 Running 0 2d1h

kube-scheduler-cor-k8s-m01 1/1 Running 1 2d1h

Let’s begin with the Pods that run on the master, node cor-k8s-m01 in this example. To see where the Pods are running, you can use the kubectl get pods -o wide -n kube-system. The first component to highlight on the master is etcd, which is a key-value store that keeps track of all of the objects in your K8s cluster.

Next we have the kube-apiserver. In a nutshell, the API server takes care of all of the interactions that take place between core components and etcd. Other core components are constantly watching the API server for any changes related to the state of the system. If there is a change to an application definition, everyone watching gets a notification of the change.

This brings us to the kube-scheduler. The scheduler is responsible for finding the best node on which to run a newly created Pod.

Now we come to the kube-controller. This core component is responsible for running the control loops (reconcilers). Since Kubernetes is a Declarative system, the purpose of these control loops is to watch the actual state of the system, and if it is different from the desired/declared state, they initiates operations to rectify the situation and make the actual state the same as the desired state. An example of this could be something as simple as attaching a Persistent Volume to a Pod. When an application is deployed in K8s, the application definition (including the declarative state of its deployment) is persisted on the master server. The API server maintains both an in-memory cache of desired state (what the system desired state is) and another in-memory cache of the actual state (real world observed state). When these caches differ, the controller is responsible for initiating tasks to rectify the difference.

To close off on the master, CoreDNS, as its name might imply is the DNS service for assigning fully qualified domain names to virtual IP addresses assigned to services. You may also see this implemented as KubeDNS.

Let’s now talk about the workers or nodes. Kube-proxy is the component that configures node networking to route network requests from virtual IP addresses of a service to the endpoint implementing the service, anywhere in the cluster. Kube-proxy is implemented as a daemonset, means that an instance runs on every node. Also in this example, we are using flannel as the network overlay. This is also implemented as a daemon set, which means there is an instance on every node.

The main component on the nodes is the kubelet. This provides a number of functions. Some of its primary functions are to check with the API server to find which Pods should be running on this node, as well report the state of its running Pods back to the API server. One other interesting item to note is that the kubelet is responsible for running in-tree cloud providers, such as VMware’s vSphere Cloud Provider (VCP).

Let’s say we deploy a new application. The scheduler will notice a new Pod has not yet been scheduled. Therefore it runs its algorithm to find out the best node on which to run the Pod. The scheduler then updates the API server for the Pod to say where is should be scheduled, i.e. which node? The kubelet on that node is monitoring the API server, and now it sees that the new Pod has been scheduled on it, but this Pod is not running. The kubelet will now start the Pod (container). The kubelet continuously monitors the Pod to make sure it is running, and will restart it If it crashes. It also runs any reconciliation control loops to bring pods to the declared state if its actual state does not match. This is how it will remain until you delete it. When you delete the application, all of the components watching it will receive the deletion event. The kubelet will notice that Pod no longer exists on API server, so it will then go ahead and delete it from its node.

For a more detailed description of K8s internals, I strongly recommend reading this excellent blog on “Events, the DNA of Kubernetes” from my colleague, Michael Gasch.

Note that PKS (Pivotal Container Services) users will see a lot of differences when they display the Pods in the kube-system namespace. Many of the Pods discussed here are implemented as process on the PKS master. However, for the purposes of failure testing, the behaviour is the same.

With all of that background, lets now go ahead and examine what happens when a node fails, and what happens to the Pods running on that node. For the purposes of these exercises, I am using a combination of “power off” and the kubectl delete node command. The impact of these two operations on the behaviour of Kubernetes failure handling will be clearly seen shortly.

Standalone Pod behaviour on Node power off

First of all, lets talk about single, standalone Pod deployments. These are Pods that are not part of any Deployment or StatefulSet, so there are no objects ensuring their availability. I think there might be a misconception that a standalone Pod like this will be restarted automatically elsewhere in the K8 cluster when the node on which it resides is powered off. I know this is what I initially expected to happen before carrying out these tests. However, that is not the case. When the node on which the Pod is running is powered off, the node enters a NotReady state after approx 40 seconds of missed heartbeats (which are sent every 10 seconds). Around 5 minutes later, the Pod is marked for eviction/deletion and enters into a Terminating state, but remains in that state indefinitely. This is because the kubelet daemon on the node cannot be reached by the API server, so the decision to delete the Pod cannot be communicated. If the node is powered on again, the API server can again communicate to the kubelet, and it is at that point the decision to delete the Pod is communicated, and so the Pod is now deleted. Note however that the Pod is not restarted automatically (unless it is part of a Deployment or StatefulSet). The application has to be restarted manually. So yes, even though the node has recovered and has rejoined the cluster, any Pods on that node which were not protected, are deleted.

In previous versions of K8s (up to v1.5), these Pods used to be ‘force deleted’ even when the API server and kubelet could not talk. In later versions (1.5 and later) K8s no longer deletes the pods until it is confirmed that they have stopped running. To remove these Pods when a node is powered off or partitioned in some way, you may need to delete the node using the kubectl delete node command. This deletes all the Pods running on the node from the API server. You can read more about this behavior here.

Standalone Pod behaviour on Node delete

To remove a node from the K8s cluster, you can issue a kubectl delete node command (this is without powering off the node). In this case, the standalone Pods running on the node are immediately deleted and the node is removed from the cluster (with a big caveat that there are no Persistent Volumes attached to the Pod being evicted from the node). If the Pod has PVs attached, we will see what happens when we talk about StatefulSets shortly. If you are doing a controlled delete of a node, but wish to restart the Pods, you should evacuate the Pods first. K8s provides a kubectl drain command for such scenarios. This will restart the Pods on another node in the cluster, again assuming that the Pod is not consuming any PVs. If it is, we will need to take some additional steps to detach the PVs from the node that is being drained. Let’s look at those scenarios next.

StatefulSet behaviour on Node delete

In this section, we will look at a node delete impact on a StatefulSet, where each Pod has its own unique Persistent Volume (which on vSphere means that it is backed by a VMDK created using the VCP – vSphere Cloud Provider driver). For the purposes of this example, I will use a Cassandra application, but I will only deploy it with a single Replica. However, the StatefulSet will still be responsible for keeping that single Pod running. In this first example, I will delete the node on which the Pod is running using the kubectl delete node command, and observe the events. What we will see is that the API server can still communicate to the kubelet, and the Pod is deleted. Then, since it is a StatefulSet, we will see the Pod restarted on another node, and the PV will attempt to attach to the new node. However, this is not possible since the original node still has the PV attached. Once we remove the virtual machine disk (VMDK) backing the PV from the original node using the vSphere Client (or power off the original node), the reconciler for the attacherDetacher controller can detach the PV, and successfully re-attach it to the new node where the Pod has been scheduled. This is identical to the standalone Pod situation mentioned in the previous scenario – manual intervention is required.

Let’ look at the actual steps involved; we will start with a look at the current configuration. Its a StatefulSet with one Pod and one PVC/PV. We can also see which node the Pod is running on using the kubectl describe pod command.

cormac@pks-cli:~/cassandra$ kubectl get pods
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   1/1     Running   0          5m19s

cormac@pks-cli:~/cassandra$ kubectl describe pod cassandra-0 | grep Node:
Node:               a2cfcf41-922d-487b-bdf6-b453d7fb3105/10.27.51.187
cormac@pks-cli:~/cassandra$

cormac@pks-cli:~/cassandra$ kubectl get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-cassandra-0   Bound    pvc-91600a74-8d16-11e9-b070-005056a2a261   1Gi        RWO            cass-sc        5m31s

cormac@pks-cli:~/cassandra$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS   REASON   AGE
pvc-91600a74-8d16-11e9-b070-005056a2a261   1Gi        RWO            Delete           Bound    cassandra/cassandra-data-cassandra-0   cass-sc                 5m26s

cormac@pks-cli:~/cassandra$ kubectl describe pv pvc-91600a74-8d16-11e9-b070-005056a2a261
Name:            pvc-91600a74-8d16-11e9-b070-005056a2a261
Labels:          <none>
Annotations:     kubernetes.io/createdby: vsphere-volume-dynamic-provisioner
                 pv.kubernetes.io/bound-by-controller: yes
                 pv.kubernetes.io/provisioned-by: kubernetes.io/vsphere-volume
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    cass-sc
Status:          Bound
Claim:           cassandra/cassandra-data-cassandra-0
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        1Gi
Node Affinity:   <none>
Message:
Source:
    Type:               vSphereVolume (a Persistent Disk resource in vSphere)
    VolumePath:         [vsanDatastore] fef94d5b-fa8b-491f-bf0a-246e962f4850/kubernetes-dynamic-pvc-91600a74-8d16-11e9-b070-005056a2a261.vmdk
    FSType:             ext4
    StoragePolicyName:  raid-1
    VolumeID:
Events:                 <none>

Next, let’s take a look at the node VM in the vSphere Client UI, and we should observe that the attached VMDK matches the Persistent Volume.

Let’s also try to put something useful into the Cassandra DB, so we can check that it is still intact after our various testing.

$ kubectl exec -it cassandra-0 -n cassandra nodetool status
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.200.100.6  172.85 KiB  32           ?       2726920f-38d4-4c41-bb16-e28bc6a2a1fb  Rack1-K8Demo

$ kubectl exec -it cassandra-0 -n cassandra -- cqlsh
Connected to K8Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh:demodb> CREATE KEYSPACE mydemo WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh:demodb> use mydemo;
cqlsh:mydemo> CREATE TABLE emp(emp_id int PRIMARY KEY, emp_name text, emp_city text, emp_sal varint,emp_phone varint);
cqlsh:mydemo> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES (100, 'Cormac', 'Cork', 999, 1000000);
cqlsh:mydemo> exit

Great. Now I will delete this node from the cluster using kubectl delete node. Within a very short space of time, I see the following events appearing for the Pod.

Normal    Started                  Pod    Started container
Warning   Unhealthy                Pod    Readiness probe failed:
Normal    Scheduled                Pod    Successfully assigned cassandra/cassandra-0 to a2cfcf41-922d-487b-bdf6-b453d7fb3105
Warning   FailedAttachVolume       Pod    Multi-Attach error for volume "pvc-91600a74-8d16-11e9-b070-005056a2a261" \
Volume is already exclusively attached to one node and can't be attached to another
Normal    Killing                  Pod    Killing container with id docker://cassandra:Need to kill Pod

The Pod is restarted on a new node, but the volume cannot be attached to the new node, since it is still attached to the original node. Also 2 minutes later, we also see a problem with the mount attempt, which obviously cannot happen since the volume is not attached:

Warning   FailedMount              Pod    Unable to mount volumes for pod "cassandra-0_cassandra(5a2dc01e-8d19-11e9-b070-005056a2a261)": \
timeout expired waiting for volumes to attach or mount for pod "cassandra"/"cassandra-0". list of unmounted volumes=[cassandra-data]. \
list of unattached volumes=[cassandra-data default-token-6fj8b]

Now, the reconciler for the attacherDetacher controller has a 6 minute timer, and once that timer has expired, it will attempt to forcibly detach the volume from the current (now deleted) node, and attach it to the new node. However, because the PV is still attached to the original node VM, even a force detach does not work.

Warning   FailedAttachVolume   Pod    AttachVolume.Attach failed for volume "pvc-91600a74-8d16-11e9-b070-005056a2a261" : \
Failed to add disk 'scsi1:0'.

And if we examine the tasks in the vSphere client UI, we observe the following errors, which match the events seen in the Pod:

OK – now we need some manual intervention to resolve the situation. To resolve this issue, I have a few choices. One option is to shutdown the node VM that I just deleted from the cluster, and still has the VMDK/PV attached. The other option is to simply remove the VMDK/PV manually from the original node VM. In this case, I will just remove the VMDK/PV from the VM via the vSphere Client UI, but make sure that the “Delete files from datastore” checkbox is not selected. If it is, then we would end up deleting the VDMK from disk, which is not what we want.

After a little time, we see the following in the logs, indicating that the PV could be successfully attached and the Pod can be started:

Normal    SuccessfulAttachVolume   Pod    AttachVolume.Attach succeeded for volume "pvc-91600a74-8d16-11e9-b070-005056a2a261"
Normal    Pulling                  Pod    pulling image "harbor.rainpole.com/library/cassandra:v11"
Normal    Pulled                   Pod    Successfully pulled image "harbor.rainpole.com/library/cassandra:v11"
Normal    Created                  Pod    Created container
Normal    Started                  Pod    Started container

Everything has returned to the normal, desired state of operation at this point. Let’s now try a different sequence.

StatefulSet behaviour on Node power off

In this next test, we are going to take a slightly different approach. In this test, we will first power off the node on which the Pod of our StatefulSet is running. Then we will see how K8s handles this, and if any further interaction is needed. Let’s check out Cassandra DB and contents before going any further.

$ kubectl exec -it cassandra-0 -n cassandra nodetool status
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.200.82.12  173.43 KiB  32           ?       2726920f-38d4-4c41-bb16-e28bc6a2a1fb  Rack1-K8Demo

$ kubectl exec -it cassandra-0 -n cassandra -- cqlsh
Connected to K8Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> select * from mydemo.emp;

emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
    100 |     Cork |   Cormac |       999 | 1000000
(1 rows)
cqlsh>

All looks good. The next step is to determine where there Pod is running (on which node VM) and then power it down.

$ kubectl describe pod cassandra-0 | grep Node:
Node:               62fc3f4f-a992-4716-b972-3958bda8b231/10.27.51.186

$ kubectl get nodes -o wide
NAME                                   STATUS     ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
52bf9358-7a6e-4552-87b1-6591dc88634c   Ready      <none>   6h20m   v1.12.4   10.27.51.185   10.27.51.185   Ubuntu 16.04.5 LTS   4.15.0-43-generic   docker://18.6.1
62fc3f4f-a992-4716-b972-3958bda8b231   NotReady   <none>   31m     v1.12.4   10.27.51.186   10.27.51.186   Ubuntu 16.04.5 LTS   4.15.0-43-generic   docker://18.6.1
a2cfcf41-922d-487b-bdf6-b453d7fb3105   Ready      <none>   7m12s   v1.12.4   10.27.51.187   10.27.51.187   Ubuntu 16.04.5 LTS   4.15.0-43-generic   docker://18.6.1
c1b53999-ac8b-4187-aae6-940bf61b4e2b   Ready      <none>   6h13m   v1.12.4   10.27.51.189   10.27.51.189   Ubuntu 16.04.5 LTS   4.15.0-43-generic   docker://18.6.1

After 40 seconds, the node enters a NotReady state. After a further 5 minutes, the Pod is marked for deletion and enters a Terminating state, as shown is the describe command below.

$ kubectl describe pod cassandra-0
Name:                      cassandra-0
Namespace:                 cassandra
Priority:                  0
PriorityClassName:         <none>
Node:                      62fc3f4f-a992-4716-b972-3958bda8b231/10.27.51.186
Start Time:                Wed, 12 Jun 2019 16:16:25 +0100
Labels:                    app=cassandra
                           controller-revision-hash=cassandra-589c765486
                           statefulset.kubernetes.io/pod-name=cassandra-0
Annotations:               <none>
Status:                    Terminating(lasts <invalid>)
Termination Grace Period:  180s
Reason:                    NodeLost
Message:                   Node 62fc3f4f-a992-4716-b972-3958bda8b231 which was running pod cassandra-0 is unresponsive
IP:                        10.200.82.12
Controlled By:             StatefulSet/cassandra

Now the interesting thing is that the system remains in this state indefinitely. The Pod is left Terminating, it isn’t scheduled on a new node, and thus there is no attempt to detach the PV from the current node, and re-attach it to the new node. This is expected, and is as currently designed, since there is no communication possible between the kubelet and the API server. It doesn’t know if the node has failed, or if it is simply network partitioned. The node could just come back at any time. So it just waits. At this point, if the node is never coming back, you could try to force delete the Pod, or delete the node. We will do the latter, so we will now need to issue a kubectl delete node command against the node that was powered off. Once that command is initiated, immediately things begin to happen. First of all, we see the Pod get deleted, and because it is a StatefulSet it is automatically re-scheduled on a new node. You can also see the State change from Terminating to Pending. It is waiting for the PV to get attached and mounted before it can start.

$ kubectl describe pod cassandra-0
Name:               cassandra-0
Namespace:          cassandra
Priority:           0
PriorityClassName:  <none>
Node:               a2cfcf41-922d-487b-bdf6-b453d7fb3105/10.27.51.187
Start Time:         Wed, 12 Jun 2019 16:51:06 +0100
Labels:             app=cassandra
                    controller-revision-hash=cassandra-589c765486
                    statefulset.kubernetes.io/pod-name=cassandra-0
Annotations:        <none>
Status:             Pending
IP:
Controlled By:      StatefulSet/cassandra

Now, initially we see some errors in the Pod’s events. Once again, the volume cannot be attached to the new node as it is still attached to the original node.

Normal    Scheduled                Pod    Successfully assigned cassandra/cassandra-0 to a2cfcf41-922d-487b-bdf6-b453d7fb3105
Warning   FailedAttachVolume       Pod    Multi-Attach error for volume "pvc-91600a74-8d16-11e9-b070-005056a2a261" \
                                          Volume is already exclusively attached to one node and can't be attached to another
Warning   FailedMount              Pod    Unable to mount volumes for pod "cassandra-0_cassandra(e3b15fb4-8d29-11e9-b070-005056a2a261)":\
                                          timeout expired waiting for volumes to attach or mount for pod "cassandra"/"cassandra-0". \
                                          list of unmounted volumes=[cassandra-data]. \
                                          list of unattached volumes=[cassandra-data default-token-6fj8b]

At this point, the 6 minute force detach timer kicks off in the reconciler of the attacherDetacher controller. Once the 6 minutes has expired, we see some new events happening on the Pod indicating the the attach of the PV succeeded, and at this point the Pod is up and running once more.

Normal    SuccessfulAttachVolume   Pod    AttachVolume.Attach succeeded for volume "pvc-91600a74-8d16-11e9-b070-005056a2a261"
Normal    Pulling                  Pod    pulling image "harbor.rainpole.com/library/cassandra:v11"
Normal    Pulled                   Pod    Successfully pulled image "harbor.rainpole.com/library/cassandra:v11"
Normal    Created                  Pod    Created container
Normal    Started                  Pod    Started container

$ kubectl describe pod cassandra-0
Name:               cassandra-0
Namespace:          cassandra
Priority:           0
PriorityClassName:  <none>
Node:               a2cfcf41-922d-487b-bdf6-b453d7fb3105/10.27.51.187
Start Time:         Wed, 12 Jun 2019 16:51:06 +0100
Labels:             app=cassandra
                    controller-revision-hash=cassandra-589c765486
                    statefulset.kubernetes.io/pod-name=cassandra-0
Annotations:        <none>
Status:             Running
IP:                 10.200.100.10
Controlled By:      StatefulSet/cassandra

Let’s finish off our test by verifying that the contents of the Cassandra DB is still intact.

$ kubectl exec -it cassandra-0 -n cassandra -- cqlsh
Connected to K8Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> use mydemo;
cqlsh:mydemo> select * from emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
    100 |     Cork |   Cormac |       999 | 1000000
(1 rows)
cqlsh:mydemo>

LGTM. So that is some interesting behaviour when a node is powered off, and took quite a bit of testing and re-testing to verify that this is indeed correct behaviour. I guess the first thing to highlight is the fact that I was running with a single Replica. In StatefulSets, you would most likely be running with multiple Replicas, and some load-balancer front end. So even if the event of a node powering off like this, leaving one Pod Terminating indefinitely, the application would still be available.

So to recap, in my observations, when a node is powered off, after ~40 seconds of no heartbeats from the node, it is marked as NotReady. After a further ~5 minutes, the Pod enter Terminating state, and remains there indefinitely. Once the kubectl node delete command is issued, the Pod is deleted and since it is a StatefulSet, is re-scheduled on a new node. 6 minutes later the reconciler for the attacherDetacher controller force detaches the PV from the original node, and now it can then be attached to the new node where the Pod has been scheduled, and mounted by the Pod.

If you are interested in the attacherDetacher events, you can use the command kubectl logs -f kube-controller-manager-xxx -n kube-system on native K8s systems. If you are using PKS, you will need to SSH to the master VM, and then change directory to /var/vcap/sys/log/kube-controller-manager. From here you can do a tail -f kube-controller-manager.stderr.log and see the various attacher-detacher events.

Recovery after shutdown/delete

Let’s now assume that whatever the issue with the node that was shutdown has been resolved, and that we now want to bring the node back into the cluster. Well, remember that if this node was powered off, when it is powered on again, this node will still think it has the original PV/VMDK attached. But at this point, we have forcibly detached it to allow it to be attached to another node so that the Pod can be restarted/rescheduled. So we are going to have to address this. Let’s see what happens when you attempt to power on the original node. Note that you will see the same issue if you try to power on the node from the vSphere client UI, but I wanted to show you a useful PKS command to do a similar task from the command line. I am going to use the cloud-check (cck) BOSH command to request a reboot of the powered off node VM.

$ bosh -d service-instance_3aa87d05-87a3-462a-8f1d-5f07cd5d7bda cck
Using environment '10.27.51.141' as client 'ops_manager'
Using deployment 'service-instance_3aa87d05-87a3-462a-8f1d-5f07cd5d7bda'
Task 1677
Task 1677 | 13:05:57 | Scanning 5 VMs: Checking VM states (00:00:21)
Task 1677 | 13:06:18 | Scanning 5 VMs: 4 OK, 1 unresponsive, 0 missing, 0 unbound (00:00:00)
Task 1677 | 13:06:18 | Scanning 5 persistent disks: Looking for inactive disks (00:00:27)
Task 1677 | 13:06:45 | Scanning 5 persistent disks: 5 OK, 0 missing, 0 inactive, 0 mount-info mismatch (00:00:00)
Task 1677 Started  Thu Jun 13 13:05:57 UTC 2019
Task 1677 Finished Thu Jun 13 13:06:45 UTC 2019
Task 1677 Duration 00:00:48
Task 1677 done

#  Type                Description
2  unresponsive_agent  VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)' with cloud ID \
'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding.
1 problems

1: Skip for now
2: Reboot VM
3: Recreate VM without waiting for processes to start
4: Recreate VM and wait for processes to start
5: Delete VM
6: Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)

VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)' with cloud ID \
'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding. (1): 2
Continue? [yN]: y

Task 1678
Task 1678 | 13:07:10 | Applying problem resolutions: VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)'\
 with cloud ID 'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding. (unresponsive_agent 80): Reboot VM (00:04:04)
                    L Error: Unknown CPI error 'Unknown' with message 'File system specific implementation of \
                      OpenFile[file] failed' in 'reboot_vm' CPI method (CPI request ID: 'cpi-735808')
Task 1678 | 13:11:14 | Error: Error resolving problem '2': Unknown CPI error 'Unknown' with message 'File system \
                       specific implementation of OpenFile[file] failed' in 'reboot_vm' CPI method (CPI request ID: \
                       'cpi-735808')
Task 1678 Started  Thu Jun 13 13:07:10 UTC 2019
Task 1678 Finished Thu Jun 13 13:11:14 UTC 2019
Task 1678 Duration 00:04:04
Task 1678 error

Resolving problems for deployment 'service-instance_3aa87d05-87a3-462a-8f1d-5f07cd5d7bda':
  Expected task '1678' to succeed but state is 'error'
Exit code 1

As we can see this request failed, and this is because the node VM is still under the impression that the PV/VMDK is attached, but of course it has been forcibly removed. This is the error as it appears in the vSphere Client UI.

We will need to manually remove the VMDK from the node VM. We saw how to do that earlier through the vSphere Client UI, ensuring that we do not remove the disk from the datastore. With the VMDK removed from the node VM completed, lets try the “cck” once more.

$ bosh -d service-instance_3aa87d05-87a3-462a-8f1d-5f07cd5d7bda cck
Using environment '10.27.51.141' as client 'ops_manager'
Using deployment 'service-instance_3aa87d05-87a3-462a-8f1d-5f07cd5d7bda'
Task 1679
Task 1679 | 13:15:13 | Scanning 5 VMs: Checking VM states (00:00:20)
Task 1679 | 13:15:33 | Scanning 5 VMs: 4 OK, 1 unresponsive, 0 missing, 0 unbound (00:00:00)
Task 1679 | 13:15:33 | Scanning 5 persistent disks: Looking for inactive disks (00:00:27)
Task 1679 | 13:16:00 | Scanning 5 persistent disks: 5 OK, 0 missing, 0 inactive, 0 mount-info mismatch (00:00:00)
Task 1679 Started  Thu Jun 13 13:15:13 UTC 2019
Task 1679 Finished Thu Jun 13 13:16:00 UTC 2019
Task 1679 Duration 00:00:47
Task 1679 done

#  Type                Description
3  unresponsive_agent  VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)' with cloud ID \
'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding.
1 problems

1: Skip for now
2: Reboot VM
3: Recreate VM without waiting for processes to start
4: Recreate VM and wait for processes to start
5: Delete VM
6: Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)

VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)' with cloud ID \
'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding. (1): 2

Continue? [yN]: y

Task 1680
Task 1680 | 13:16:10 | Applying problem resolutions: VM for 'worker/580b6ee0-970e-44cb-896c-9520da134b30 (1)' \
with cloud ID 'vm-33849704-0c77-450b-939c-dbea525e6df7' is not responding. (unresponsive_agent 80): Reboot VM (00:01:21)
Task 1680 Started  Thu Jun 13 13:16:10 UTC 2019
Task 1680 Finished Thu Jun 13 13:17:31 UTC 2019
Task 1680 Duration 00:01:21
Task 1680 done

Succeeded

Success! The VM powers back on, and should automatically rejoin the cluster.

$ kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
52bf9358-7a6e-4552-87b1-6591dc88634c   Ready    <none>   29h   v1.12.4
62fc3f4f-a992-4716-b972-3958bda8b231   Ready    <none>   68m   v1.12.4
a2cfcf41-922d-487b-bdf6-b453d7fb3105   Ready    <none>   22h   v1.12.4
c1b53999-ac8b-4187-aae6-940bf61b4e2b   Ready    <none>   29h   v1.12.4

That completes our overview of failure handling, focusing on node failures/shutdowns. Hope it helps to clarify some of the expected behaviors.

Summary

If you’ve read this far, you’re probably asking yourself why is this not handled automatically? For example, why isn’t there a force detach when the node isn’t accessible for a period of time? That is a great question, and is the same question a lot of people are asking in the Kubernetes community. I want to point out that this is not just behavior that is seen with vSphere, but is seen on lots of other platforms. I know there are a few discussions going on in the community about whether or not this is desired behaviour, and if there is something that could be done to prevent Pods remaining in a Terminating state indefinitely. These are still being debated but at the moment, this is how Kubernetes behaves.

Manifests used in this demo can be found on my vsphere-storage-101 github repo.

2 Replies to “Kubernetes Storage on vSphere 101 – Failure Scenarios”

Glenn Dekhayser says:

June 24, 2019 at 4:03 pm

Question. If one was using, say an NFS volume/mount point as their PV instead of a VMDK which requires “ownership”, would that change the outcomes above?
1. Cormac says:
  
  June 24, 2019 at 4:16 pm
  
  I only tested these scenarios with the VCP (vSphere Cloud Provider) driver Glenn. I’m hoping to do some further testing with the newer CSI driver shortly. I can look at NFS sometime after that. At the moment, I don’t know how it would behave, but I have seen ‘lock’ related issues reported with NFS (e.g. server fails, clients still trying to mount).

Comments are closed.