Kubernetes, Hadoop, Persistent Volumes and vSAN

At VMworld 2018, one of the sessions I presented on was running Kubernetes on vSphere, and specifically using vSAN for persistent storage. In that presentation (which you can find here), I used Hadoop as a specific example, primarily because there are a number of moving parts to Hadoop. For example, there is the concept of Namenode and a Datanode. Put simply, a Namenode provides the lookup for blocks, whereas Datanodes store the actual blocks of data. Namenodes can be configured in a HA pair with a standby Namenode, but this requires a lot more configuration and resources, and introduces additional components such as journals and zookeeper to provide high availability. There is also the option of a secondary Namenode, but this does not provide high availability. On the other hand, datanodes have their own built-in replication. The presentation showed how we could use vSAN to provide additional resilience to a namenode, but use less capacity and resources if a component like the Datanode has its own built-in protection.

A number of people have asked me how they could go about setting up such a configuration for themselves. They were especially interested in how to consume different policies for the different parts of the application.

In this article I will take a very simple namenode and Datanode configuration that will use persistent volumes with different policies on our vSAN datastore.

We will also show how through the use of a Statefulset, we can very easily scale the number of Datanodes (and their persistent volumes) from 1 to 3.

Helm Charts

To begin with, I tried to use the Helm charts to deploy Hadoop. You can find it here. While this was somewhat successful, I was unable to figure out a way to scale the deployment’s persistent storage. Each attempt to scale the statefulset successfully created new Pods, but all of the Pods tried to share the same persistent volume. And although you can change the persistent volume from ReadWriteOnce (single Pod access) to ReadWriteMany (multiple Pod access), this is not allowed on vSAN even though K8s does provide a config option for multiple Pods to share the same PV. On vSAN, Pods cannot share PVs at the time of writing.

However, if you simply want to deploy a single Namenode and a single Datanode with persistent volume, this stable Hadoop Helm chart will work just fine for you. Its also quite possible that this is achievable with the Helm chart. I’m afraid my limited knowledge of Helm meant that in order to create unique PVs for each Pod as I scaled my Datanodes, I had to look for an alternate method. This  led me to a Hadoop Cluster on Kubernetes using flokkr docker images that was already available on GitHub.

Hadoop Cluster on Kubernetes

Let’s talk about the flokkr Hadoop cluster. In this solution, there were only two YAML files; the first was the config.yaml which passed in a bunch of environment variables to our Hadoop deployment (core-site.xml, yarn-site.xml, etc) via a configMap (more on this shortly). The second held details about the services and Statefulsets for the Namenode and Datanode. I will modify the  Statefulsets so that the /data directory on the nodes will be placed on persistent volumes rather than using a local filesystem within the container (which is not persistent).

Use of configMap

I’ll be honest – this was the first time I has used the configMap construct, but it looks pretty powerful. In the config.yaml, there are entries for 5 configuration files required by Hadoop when the environment is bootstrapped – the core-site.xml, the hdfs-site.xml, the log4j.properties file, the mapred-site.xml and finally the yarn-site.xml. These 5 different piece of data can be seen when we query the configmap.

root@srvr:~/hdfs# kubectl get configmaps
NAME         DATA   AGE
hadoopconf   5      88m

When we look at the hdfs.yaml, we will see how this configMap is referenced, and how these configuration files are made available on a specific folder/directory in the application’s containers when the application is launched. First, lets look at the StatefulSet.spec.template.spec.containers.volumeMounts entry:

volumeMounts:
            - name: config
              mountPath: "/opt/hadoop/etc/hadoop"

This is where the files referenced in config.yaml entries are going to be placed when the container/application is launched. If we look at that volume in more detail in StatefulSet.spec.template.spec.containers.volumes we see the following:

      volumes:
        - name: config
          configMap:
            name: hadoopconf

So the configMap in config.yaml, which is named hadoopconf, will placed these 5 configuration files on “/opt/hadoop/etc/hadoop” when the application launches. The application contains and init/bootstrap script which will deploy hadoop using the configuration in these files. A little bit complicated, but sort of neat. You do not need to make any changes here. Instead, we want to change the /data folder to use Persistent Volumes. Let’s see how to do that next.

Persistent Volumes – changes to hfds.yaml

Now there are a few changes needed to the  hdfs.yaml file to get it to use persistent volumes. We will need to make some changes to the StatefulSet for the Datanode and the Namenode. First, we will need to add a new mount point for the PV, which for Hadoop, will of course be /data. This will appear in StatefulSet.spec.template.spec.containers.volumeMounts and look like the following:

          volumeMounts:
            - name: data
              mountPath: "/data"
              readOnly: false

Next we will need to make some specifications around the volume that we are going to use. For the PV, we are not using a StatefulSet.spec.template.spec.containers.volumes as used by the configMap. Instead we will use the StatefulSet.spec.template.volumeClaimTemplate. This is what that would look like. First we have the Datanode entry, and then the Namenode entry. The differences are the storage class entries and of course the volume sizes. This is how we will use different storage policies on vSAN for the Datanode and Namenode.

volumeClaimTemplates:
  - metadata:
      name: data
      annotations:
        volume.beta.kubernetes.io/storage-class: hdfs-dn-sc
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
       requests:
         storage: 200Gi

volumeClaimTemplates:
  - metadata:
      name: data
      annotations:
        volume.beta.kubernetes.io/storage-class: hdfs-nn-sc
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
       requests:
         storage: 50Gi

Storage Classes

We saw in the hdfs.yaml that the Datanode and Namenode referenced two different storage classes. Let’s look at those next. First is the Namenode.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: hdfs-nn-sc
provisioner: kubernetes.io/vsphere-volume
parameters:
    diskformat: thin
    storagePolicyName: gold
    datastore: vsanDatastore

As you can see, the Namenode storageClass uses a gold policy on the vSAN datastore. If we compare it to the Datanode storageClass, you will see that this is the only difference (other than the name). The provisioner is the VMware vSphere Cloud Provider (VCP) which is currently included in K8s distributions but will soon be decoupled along with other built-in drivers as part of the CSI initiative.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: hdfs-dn-sc
provisioner: kubernetes.io/vsphere-volume
parameters:
    diskformat: thin
    storagePolicyName: silver
    datastore: vsanDatastore

vSAN Policies – storagePolicyName

Now you may be wondering why we are using two different policies. This is the beauty of vSAN. As we shall see shortly, the datanodes have their own built-in replication mechanism (3 copies of each block are stored). Thus, it is possible to deploy the Datanode volumes on vSAN without any underlying protection from vSAN (e.g. RAID-0) by simply specifying a policy (silver). This is because if a Datanode fails, there are still two copies of the data blocks.

However if we take the Namenode, it has no such built-in replication or protection feature. Therefore we could offer to protect the underlying persistent volume using an appropriate vSAN policy (e.g. RAID-1, RAID-5). In my example, the gold policy provides this extra protection for the Namenode volumes.

Deployment of Hadoop

Now there are only a few steps to deploying the Hadoop application. (1) Create the storage classes, (2) create the configMap in the config.yaml and (3) create the services and Statefulsets in the hdfs.yaml. All of these can be done by the kubectl create -f “file.yaml” commands.

Post Deployment Checks

The following are a bunch of commands that can be used to validate the state of the constituent components of the application after deployment.

root@srvr:~/cormac-hadoop# kubectl get nodes
NAME                STATUS                     ROLES    AGE   VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>   46h   v1.12.0
kubernetes-node1    Ready                      <none>   46h   v1.12.0
kubernetes-node2    Ready                      <none>   46h   v1.12.0
kubernetes-node3    Ready                      <none>   46h   v1.12.0

root@srvr:~/cormac-hadoop# kubectl get configmaps 
NAME         DATA   AGE 
hadoopconf   5      150m

root@srvr:~/cormac-hadoop# kubectl get sc
NAME         PROVISIONER                    AGE
hdfs-sc-dn   kubernetes.io/vsphere-volume   40m
hdfs-sc-nn   kubernetes.io/vsphere-volume   40m

root@srvr:~/cormac-hadoop# kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
datanode     ClusterIP   None         <none>        80/TCP      15m
kubernetes   ClusterIP   10.0.0.1     <none>        443/TCP     46h
namenode     ClusterIP   None         <none>        50070/TCP   15m

root@srvr:~/cormac-hadoop# kubectl get statefulsets
NAME       DESIRED   CURRENT   AGE
datanode   1         1         16m
namenode   1         1         16m

root@srvr:~/cormac-hadoop# kubectl get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dn-data-datanode-0   Bound    pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     10m
nn-data-namenode-0   Bound    pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            hdfs-sc-nn     10m

root@srvr:~/cormac-hadoop# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   REASON   AGE
pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-0   hdfs-sc-dn              10m
pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            Delete           Bound    default/nn-data-namenode-0   hdfs-sc-nn              10m

root@srvr:~/cormac-hadoop# kubectl get statefulsets
NAME       DESIRED   CURRENT   AGE
datanode   1         1         21m
namenode   1         1         21m

Post deploy Hadoop check

Since this is Hadoop, we can very quickly use some Hadoop utilities to check the state of our Hadoop Cluster running on Kubernetes. Here is the output of one such command which generates a report on the HDFS filesystem and also reports on the Datanodes. Note the capacity at the beginning, as we will return to this after scale out.

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /opt/hadoop/bin/hdfs dfsadmin -report
Configured Capacity: 210304475136 (195.86 GB)
Present Capacity: 210224738304 (195.79 GB)
DFS Remaining: 210224709632 (195.79 GB)
DFS Used: 28672 (28 KB)
DFS Used%: 0.00%
Replicated Blocks:
     Under replicated blocks: 0
     Blocks with corrupt replicas: 0
     Missing blocks: 0
     Missing blocks (with replication factor 1): 0
     Pending deletion blocks: 0
Erasure Coded Block Groups: 
     Low redundancy block groups: 0
     Block groups with corrupt internal blocks: 0
     Missing block groups: 0
     Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 172.16.96.5:9866 (172.16.96.5)
Hostname: datanode-0.datanode.default.svc.cluster.local
Decommission Status : Normal
Configured Capacity: 210304475136 (195.86 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 62959616 (60.04 MB)
DFS Remaining: 210224709632 (195.79 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.96%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Dec 05 13:17:29 GMT 2018
Last Block Report: Wed Dec 05 13:05:56 GMT 2018

root@srvr:~/cormac-hadoop#

Post deploy check on config files

We mentioned that the purpose of the configMap in the config.yaml was to put in place a set of configuration files that can be used to bootstrap Hadoop. This will show you how to verify that this step has indeed occurred (should you need to troubleshoot at any point). First we will open a bash shell to the Namenode, and then we can navigate to the directory mount point highlighted in the hdfs.yaml to verify that the files exist, which indeed they do in this case.

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /bin/bash

bash-4.3$ cd /opt/hadoop/etc/hadoop                                                         
bash-4.3$ ls
core-site.xml     hdfs-site.xml     log4j.properties  mapred-site.xml   yarn-site.xml
bash-4.3$ cat core-site.xml
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://namenode-0:9000</value></property>
</configuration>bash-4.3$

Simply type exit to return from the container shell.

Scale out the Datanode statefulSet

We are going to start with the current configuration of 1 Datanode statefulset and 1 Namenode statefulset and we will scale the Datanode statefulset to 3. This should create additional Pods as well as additional persistent volumes and persistent volume claims. Let’s see.

root@srvr:~/cormac-hadoop# kubectl get statefulsets
NAME       DESIRED   CURRENT   AGE
datanode   1         1         21m
namenode   1         1         21m

root@srvr:~/cormac-hadoop# kubectl scale --replicas=3 statefulsets/datanode
statefulset.apps/datanode scaled

root@srvr:~/cormac-hadoop# kubectl get pvc
NAME                 STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dn-data-datanode-0   Bound     pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     16m
dn-data-datanode-1   Pending                                                                        hdfs-sc-dn     7s
nn-data-namenode-0   Bound     pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            hdfs-sc-nn     16m

root@srvr:~/cormac-hadoop# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   REASON   AGE
pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-0   hdfs-sc-dn              16m
pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            Delete           Bound    default/nn-data-namenode-0   hdfs-sc-nn              16m
pvc-9cd769f7-f890-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-1   hdfs-sc-dn              6s

root@srvr:~/cormac-hadoop# kubectl get pods
NAME         READY   STATUS              RESTARTS   AGE
datanode-0   1/1     Running             0          16m
datanode-1   0/1     ContainerCreating   0          24s
namenode-0   1/1     Running             0          16m

root@srvr:~/cormac-hadoop# kubectl get statefulsets
NAME       DESIRED   CURRENT   AGE
datanode   3         2         22m
namenode   1         1         22m

So changes are occurring, but they will take a little time. Let’s retry those commands again now that some time has passed.

root@srvr:~/cormac-hadoop# kubectl get statefulsets
NAME       DESIRED   CURRENT   AGE
datanode   3         3         23m
namenode   1         1         23m

root@srvr:~/cormac-hadoop# kubectl get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dn-data-datanode-0   Bound    pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     18m
dn-data-datanode-1   Bound    pvc-9cd769f7-f890-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     2m12s
dn-data-datanode-2   Bound    pvc-b7ce55f9-f890-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     87s
nn-data-namenode-0   Bound    pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            hdfs-sc-nn     18m

root@srvr:~/cormac-hadoop# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   REASON   AGE
pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-0   hdfs-sc-dn              18m
pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            Delete           Bound    default/nn-data-namenode-0   hdfs-sc-nn              18m
pvc-9cd769f7-f890-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-1   hdfs-sc-dn              2m3s
pvc-b7ce55f9-f890-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-2   hdfs-sc-dn              86s

root@srvr:~/cormac-hadoop# kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
datanode-0   1/1     Running   0          18m
datanode-1   1/1     Running   0          2m19s
datanode-2   1/1     Running   0          94s
namenode-0   1/1     Running   0          18m
root@srvr:~/cormac-hadoop# 
root@srvr:~/cormac-hadoop# kubectl get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dn-data-datanode-0   Bound    pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     18m
dn-data-datanode-1   Bound    pvc-9cd769f7-f890-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     2m12s
dn-data-datanode-2   Bound    pvc-b7ce55f9-f890-11e8-9e4a-005056970672   200Gi      RWO            hdfs-sc-dn     87s
nn-data-namenode-0   Bound    pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            hdfs-sc-nn     18m

root@srvr:~/cormac-hadoop# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   REASON   AGE
pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-0   hdfs-sc-dn              18m
pvc-5c68ed9b-f88e-11e8-9e4a-005056970672   50Gi       RWO            Delete           Bound    default/nn-data-namenode-0   hdfs-sc-nn              18m
pvc-9cd769f7-f890-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-1   hdfs-sc-dn              2m3s
pvc-b7ce55f9-f890-11e8-9e4a-005056970672   200Gi      RWO            Delete           Bound    default/dn-data-datanode-2   hdfs-sc-dn              86s

root@srvr:~/cormac-hadoop# kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
datanode-0   1/1     Running   0          18m
datanode-1   1/1     Running   0          2m19s
datanode-2   1/1     Running   0          94s
namenode-0   1/1     Running   0          18m

Now we can see how the Datanode has scaled out with additional Pods and storage.

Post scale-out application check

And now for the final step – let’s check to see if the HDFS has indeed scaled out with those new Pods and PVs. We will run the same command as before and get an updated report from the application. Note the updated capacity figure and the additional Datanodes.

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /opt/hadoop/bin/hdfs dfsadmin -report
Configured Capacity: 630913425408 (587.58 GB)
Present Capacity: 630674206720 (587.36 GB)
DFS Remaining: 630674128896 (587.36 GB)
DFS Used: 77824 (76 KB)
DFS Used%: 0.00%
Replicated Blocks:
     Under replicated blocks: 0
     Blocks with corrupt replicas: 0
     Missing blocks: 0
     Missing blocks (with replication factor 1): 0
     Pending deletion blocks: 0
Erasure Coded Block Groups:
     Low redundancy block groups: 0
     Block groups with corrupt internal blocks: 0
     Missing block groups: 0
     Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 172.16.86.5:9866 (172.16.86.5)
Hostname: datanode-2.datanode.default.svc.cluster.local
Decommission Status : Normal
Configured Capacity: 210304475136 (195.86 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 62963712 (60.05 MB)
DFS Remaining: 210224709632 (195.79 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.96%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Dec 05 13:24:37 GMT 2018
Last Block Report: Wed Dec 05 13:22:34 GMT 2018

Name: 172.16.88.2:9866 (172.16.88.2)
Hostname: datanode-1.datanode.default.svc.cluster.local
Decommission Status : Normal
Configured Capacity: 210304475136 (195.86 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 62963712 (60.05 MB)
DFS Remaining: 210224709632 (195.79 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.96%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Dec 05 13:24:39 GMT 2018
Last Block Report: Wed Dec 05 13:21:57 GMT 2018

Name: 172.16.96.5:9866 (172.16.96.5)
Hostname: datanode-0.datanode.default.svc.cluster.local
Decommission Status : Normal
Configured Capacity: 210304475136 (195.86 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 62959616 (60.04 MB)
DFS Remaining: 210224709632 (195.79 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.96%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Dec 05 13:24:38 GMT 2018
Last Block Report: Wed Dec 05 13:05:56 GMT 2018

Checking the Replication Factor of HDFS

Last thing we want to check is to make sure that the Datanodes are indeed replicating. There are a few ways to do this. The following commands create a simple file, and then validate the replication factor. In both cases, the commands return 3 which is the default replication factor for HDFS.

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /opt/hadoop/bin/hdfs dfs -touchz /out.txt

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /opt/hadoop/bin/hdfs dfs -stat %r /out.txt
3

root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \
-- /opt/hadoop/bin/hdfs dfs -ls /out.txt
-rw-r--r--   3 hdfs admin          0 2018-12-05 15:55 /out.txt

Conclusion

That looks to have scaled out just fine. Now there are a few things to keep in mind when dealing with StatefulSets, as per the guidance found here. Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.

With that in mind, I hope this has added some clarity to how we can provide different vSAN policies to different parts of a cloud native application, providing additional protection when the application needs it, but not consuming any additional HCI (hyperconverged infrastructure) resources when the application is able to protect itself through built-in mechanisms.