Read-Write-Many Persistent Volumes with vSAN 7 File Services

A few weeks back, just after the vSphere 7.0 launch event, I wrote an article about Native File Services in vSAN 7.0. I had a few questions asking why we decided on NFS support in this initial release, and not something like SMB or some other protocol. The reason is quite straight-forward. We are positioning vSAN as a platform for both traditional virtual machine workloads and newer containerized workloads. We chose NFS to address a storage requirement in Kubernetes, namely a way to share Persistent Volumes between Pods. To date, the vSphere CSI driver only provisioned block based Persistent Volumes which were Read-Write-Once, meaning that only one Pod could consume the volume at a time. With the new vSphere CSI driver, VMware now supports the dynamic provisioning of file shares on vSAN 7.0 with the File Service feature enabled. These shares can then be consumed by container workloads. In this post, I want to show you how this works.

Disclaimer: “To be clear, this post is based on a pre-GA version of the both vSAN 7 File Services and the new vSphere CSI driver. While the assumption is that not much should change between the time of writing and when the products becomes generally available, I want readers to be aware that feature behaviour and the user interface could still change before then.”

I’m not going to spend any time talking about the deployment and configuration of vSAN 7 File Services as this has already been covered in the earlier post. In that post, we saw how to manually create an NFS file share. In this post, we will see how a file share is dynamically instantiated when a Kubernetes application requests a Read-Write-Many (RWX) Persistent Volume using a StorageClass that refers to vSAN file services, and the Kubernetes(K8s) cluster is deployed on vSphere with the new CSI driver.

ReadWriteOnce (Block) Persistent Volumes Revisited

Let’s start with a quick deployment of an application that uses Block Persistent Volumes, just so we can look at how that behaves on vSphere/vSAN. This will show you some of the extensive features we have in the vSphere UI for managing and monitoring containers, as well as demonstrate the issues with trying to share a block PV between two Pods.

What I am doing here is:

  1. Create a StorageClass which is using a (block) RAID1 policy, implying the Persistent Volume will be instantiated on my vSAN datastore as a block VMDK.
  2. Create a ReadWriteOnce Persistent Volume Claim (PVC), to manually create a Persistent Volume (PV).
  3. Create a Pod what uses that PVC which in turn means that it gets the PV associated with the PVC.
  4. Launch another Pod with the same PVC which will demonstrate that the Pods cannot share the same RWO block volume.

Here are the YAML manifests that I will use for this demo. The kind field describes what each object is. This first manifest is the StorageClass, which put simply, select a vSphere datastore in which to place any provisioned Persistent Volumes. The provisioner field is a reference to the VMware CSI driver. The storagepolicyname field refers to an SPBM policy is vSphere. In this case, that policy will result in selecting my vSAN datastore.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: vsan-block-sc
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: "RAID1"

This is the Persistent Volume Claim manifest. It results in the creation of a 2Gi Persistent Volume (VMDK) on vSphere storage reference by the StorageClassName. Since this StorageClassName refers to the StorageClass above, this PV will be created on my vSAN datastore.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: block-pvc
spec:
  storageClassName: vsan-block-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi

This is the Pod manifest. It will create a simple “busybox” Pod which mounts the volume referenced by the claimName block-pvc which is our PVC above. This will result in a 2G VMDK attached and mounted on /mnt/volume1 in the Pod.

apiVersion: v1
kind: Pod
metadata:
  name: block-pod-a
spec:
  containers:
  - name: block-pod-a
    image: "k8s.gcr.io/busybox"
    volumeMounts:
    - name: block-vol
      mountPath: "/mnt/volume1"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: block-vol
      persistentVolumeClaim:
        claimName: block-pvc

This is the second Pod which is identical to the first. It will also try to attach and mount the same PV. However since this is a RWO persistent volume, and it is already attached and mounted to the first Pod, we will see the  attach operation of the PV to this Pod fail.

apiVersion: v1
kind: Pod
metadata:
  name: block-pod-b
spec:
  containers:
  - name: block-pod-b
    image: "k8s.gcr.io/busybox"
    volumeMounts:
    - name: block-vol
      mountPath: "/mnt/volume1"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: block-vol
      persistentVolumeClaim:
        claimName: block-pvc

Let’s create each object, and monitor the vSphere client for changes/updates. Let’s create the StorageClass first.

$ ls
block-pod-a.yaml block-pod-b.yaml block-pvc.yaml block-sc.yaml

$ kubectl apply -f block-sc.yaml
storageclass.storage.k8s.io/vsan-block-sc created

$ kubectl get sc
NAME          PROVISIONER             AGE
vsan-block-sc csi.vsphere.vmware.com  4s

Next step is to create the PVC and query the resulting PV.

$ kubectl apply -f block-pvc.yaml
persistentvolumeclaim/block-pvc created

$ kubectl get pvc
NAME      STATUS VOLUME                                   CAPACITY  ACCESS MODES STORAGECLASS   AGE
block-pvc Bound  pvc-6398f749-f93e-4dfd-8121-cf91926d642e 2Gi       RWO          vsan-block-sc  10s

$ kubectl get pv
NAME                                      CAPACITY  ACCESS MODES  RECLAIM POLICY  STATUS  CLAIM              STORAGECLASS   REASON  AGE
pvc-6398f749-f93e-4dfd-8121-cf91926d642e  2Gi       RWO           Delete          Bound   default/block-pvc  vsan-block-sc          13s

And this volume is now visible in the Container Volumes view in the vSphere client, thanks to CNS.

We can see that it has been instantiated as a RAID-1 (Mirror) on the vSAN datastore. If we click on the volume name, and then View Placement Details, we can see that the actual layout of the PV across vSAN hosts and vSAN disks, allowing vSphere admins to track right down and see exactly which infrastructure components are backing a volume.

Another useful feature for the vSphere admin is to identify “stranded” persistent volumes, in other words identify PVs that are not attached to any Pod. We have not yet deployed any Pod, so let’s see how the PV appears in the vSAN > Capacity > Usage breakdown. Here I have expanded User objects. Note that it is identifying “Block container volumes (not attached to a VM)”. By VM, we mean Kubernetes worker node. This is why it appears in the User object report and not the VM report. For a Pod to mount a volume, that volume needs to be attached to the Kubernetes node, and the kubelet running inside the node will then take care of making it available to the Pod. Since we do not have any Pod yet to consume this volume, the PV’s capacity appears under the “not attached” section.

I can click on the User objects in the chart to get a more detailed view.

Now the numbers here are pretty small since I have only provisioned a single PV and vSAN always provisions thin disks. But hopefully it gives you a good idea of the level of detail we can get, and how this can really help a vSphere admin who is managing vSphere infrastructure which is hosting Kubernetes clusters.

Let’s now go ahead and provision our first Pod.

$ kubectl apply -f block-pod-a.yaml
pod/block-pod-a created

$ kubectl get pod
NAME        READY STATUS            RESTARTS AGE
block-pod-a 0/1   ContainerCreating 0        4s

$ kubectl get pod
NAME        READY STATUS  RESTARTS AGE
block-pod-a 1/1   Running 0        16s

The Pod is ready. We can use the following command to examine the events associated with the creation of the Pod.

$ kubectl get event --field-selector involvedObject.name=block-pod-a
LAST SEEN TYPE    REASON                  OBJECT           MESSAGE
1m        Normal  Scheduled               pod/block-pod-a  Successfully assigned default/block-pod-a to k8s-worker7-02
1m        Normal  SuccessfulAttachVolume  pod/block-pod-a  AttachVolume.Attach succeeded for volume "pvc-6398f749-f93e-4dfd-8121-cf91926d642e"
1m        Normal  Pulling                 pod/block-pod-a  Pulling image "k8s.gcr.io/busybox"
1m        Normal  Pulled                  pod/block-pod-a  Successfully pulled image "k8s.gcr.io/busybox"
1m        Normal  Created                 pod/block-pod-a  Created container block-pod-a
1m        Normal  Started                 pod/block-pod-a  Started container block-pod-a

It looks like the volume was successfully attached. Let’s log onto it and check to see if it mounted the volume successfully.

$ kubectl exec -it block-pod-a -- /bin/sh

/ # mount | grep /mnt/volume1
/dev/sdb on /mnt/volume1 type ext4 (rw,relatime,data=ordered)

/ # df /mnt/volume1
Filesystem  1K-blocks  Used  Available  Use%  Mounted on
/dev/sdb    1998672    6144  1871288    0%    /mnt/volume1
/ #

This looks good as well. And now that the volume has been attached to a VM (the K8s worker node where the Pod was scheduled), the Usage breakdown now changes how it is reported. It now appears in the VM category, under Block container volumes (attached to a VM). So it is not longer stranded. Very useful.

Now let’s show you what happens if we try to deploy another Pod that also attempts to attach and mount this ReadWriteOnce block volume:

$ kubectl apply -f block-pod-b.yaml
pod/block-pod-b created

$ kubectl get pod
NAME         READY  STATUS            RESTARTS  AGE
block-pod-a  1/1    Running           0         15m
block-pod-b  0/1    ContainerCreating 0         7s

$ kubectl get event --field-selector involvedObject.name=block-pod-b
LAST SEEN  TYPE     REASON              OBJECT           MESSAGE
16s        Normal   Scheduled           pod/block-pod-b  Successfully assigned default/block-pod-b to k8s-worker7-01
16s        Warning  FailedAttachVolume  pod/block-pod-b  Multi-Attach error for volume "pvc-6398f749-f93e-4dfd-8121-cf91926d642e" Volume is already used by pod(s) block-pod-a

As expected, we failed to attach the volume to Pod B since RWO volumes cannot be simultaneously attached to multiple Pods, and the volume is already in use by Pod A. Since Kubernetes is an eventually consistent system, it continues to try and reconcile this request to create the container. However this will never succeed, and will remain in this ContainerCreating state until we tell it to stop.

Let’s now finish with RWO block volumes and take a closer look at RWX file volumes.

ReadWriteMany (File) Persistent Volumes

Just like we did with block volumes, I am going to do a similar demonstration with file volumes. The steps will be:

  1. Create a StorageClass which is using a (file share) RAID1 policy, implying the Persistent Volume will be instantiated on my vSAN datastore as an NFS file share.
  2. Create a ReadWriteMany Persistent Volume Claim (PVC), to manually create a Persistent Volume (PV).
  3. Create a Pod what uses that PVC which in turn means that it gets the PV associated with the PVC.
  4. Launch another Pod with the same PVC which will demonstrate that these Pods can share the same RWX volume.

Here are the manifests. They are similar to the block manifests in many ways. However there is a new optional parameter in the StorageClass so that the file system type (fstype) can be specified.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: vsan-file-sc
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: "RAID1"
  csi.storage.k8s.io/fstype: nfs4

[Update] In an earlier version of this post (pre-GA), I showed parameters such as allowrootpermission, and ips in the StorageClass manifest. In the GA version of the CSI driver, these parameters were moved to the vsphere.conf, the CSI driver configuration file. Click here to see examples of file share entries in the vsphere.conf. If you need to add new entries to the vsphere.conf, e.g. permissions or IP ranges for new volumes, you do not need to redeploy the CSI driver. simply modify the vsphere.conf configuration with the new details, update the secret and the new configuration will be loaded after a short wait, the time defined by the kubelet sync period and cache propagation delay. For file volumes which are already provisioned, permissions cannot be modified.

The PersistentVolumeClaim manifest is almost identical to the previous block example. The major difference is that the accessMode is now set to ReadWriteMany whereas previously it was set to ReadWriteOnce.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: file-pvc
spec:
  storageClassName: vsan-file-sc
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

The Pod manifests are also very similar to before. The idea here is to demonstrate that both Pods can attach, mount and write to the same ReadWriteMany shared volume simultaneously.

apiVersion: v1
kind: Pod
metadata:
  name: file-pod-a
spec:
  containers:
  - name: file-pod-a
    image: "k8s.gcr.io/busybox"
    volumeMounts:
    - name: file-vol
      mountPath: "/mnt/volume1"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: file-vol
      persistentVolumeClaim:
        claimName: file-pvc

This is the second Pod which will try to share the RWX persistent volume with the first Pod.

apiVersion: v1
kind: Pod
metadata:
  name: file-pod-b
spec:
  containers:
  - name: file-pod-b
    image: "k8s.gcr.io/busybox"
    volumeMounts:
    - name: file-vol
      mountPath: "/mnt/volume1"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: file-vol
      persistentVolumeClaim:
        claimName: file-pvc

Let’s begin the same way as we did for block, first creating the storage class, then the PVC. We will then see what changes have occurred on the vSphere client UI. I won’t remove the block objects created previously so that you can compare them to the objects created for file.

$ kubectl apply -f file-sc.yaml
storageclass.storage.k8s.io/vsan-file-sc created


$ kubectl get sc
NAME            PROVISIONER              AGE
vsan-block-sc   csi.vsphere.vmware.com   76m
vsan-file-sc    csi.vsphere.vmware.com   5s


$ kubectl apply -f file-pvc.yaml
persistentvolumeclaim/file-pvc created


$ kubectl get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
block-pvc   Bound    pvc-6398f749-f93e-4dfd-8121-cf91926d642e   2Gi        RWO            vsan-block-sc   74m
file-pvc    Bound    pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac   2Gi        RWX            vsan-file-sc    25s


$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM               STORAGECLASS    REASON   AGE
pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac   2Gi        RWX            Delete           Bound    default/file-pvc    vsan-file-sc             17s
pvc-6398f749-f93e-4dfd-8121-cf91926d642e   2Gi        RWO            Delete           Bound    default/block-pvc   vsan-block-sc            74m

Note the access mode on the PVC and PV. The new volume is RWX, meaning ReadWriteMany. Let’s see what changes have occurred in the vSphere client after running the above commands. If we navigate to Configure > vSAN > File Service Shares, we observe that a new dynamically create file share now exists, of type Container Volume.

And if we click on the name of the volume, we once again get taken to the Monitor > vSAN > Virtual Objects view where we can see the object placement details, just like we could for the block volume previously. And the container volume also appears in the Container Volumes view as well.

Let’s now deploy the first of our Pods and see what we get. Notice that the second block Pod is still ‘ContainerCreating’ from earlier, and will continue to do so indefinitely.

$ kubectl apply -f file-pod-a.yaml
pod/file-pod-a created


$ kubectl get pod
NAME          READY   STATUS              RESTARTS   AGE
block-pod-a   1/1     Running             0          68m
block-pod-b   0/1     ContainerCreating   0          52m
file-pod-a    1/1     Running             0          22s


$  kubectl get event --field-selector involvedObject.name=file-pod-a
LAST SEEN   TYPE     REASON                   OBJECT           MESSAGE
51s         Normal   Scheduled                pod/file-pod-a   Successfully assigned default/file-pod-a to k8s-worker7-01
51s         Normal   SuccessfulAttachVolume   pod/file-pod-a   AttachVolume.Attach succeeded for volume "pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac"
42s         Normal   Pulled                   pod/file-pod-a   Container image "gcr.io/google_containers/busybox:1.24" already present on machine
42s         Normal   Created                  pod/file-pod-a   Created container file-pod-a
42s         Normal   Started                  pod/file-pod-a   Started container file-pod-a


The Pod was created and the volume was attached successfully, as per the events above. OK – let’s do something simple on the volume to show that we can read and write to it.

$ kubectl exec -it file-pod-a -- /bin/sh 

/# mount | grep /mnt/volume1 
10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 on /mnt/volume1 type nfs4 \
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,\
timeo=600,retrans=2,sec=sys,clientaddr=192.168.232.1,local_lock=none,addr=10.27.51.31) 

/ # df /mnt/volume1 
Filesystem           1K-blocks      Used Available Use% Mounted on 
10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25                      
                     4554505216         0 4554402816   0% /mnt/volume1

/ # cd /mnt/volume1

/mnt/volume1 # mkdir CreatedByPodA

/mnt/volume1 # cd CreatedByPodA

/mnt/volume1/CreatedByPodA # echo "Pod A was here" >> sharedfile

/mnt/volume1/CreatedByPodA # cat sharedfile
Pod A was here

The next step is to launch our second Pod, and make sure it can share access and mount the same volume.

$ kubectl get pods
NAME         READY  STATUS            RESTARTS  AGE
block-pod-a  1/1    Running           0         116m
block-pod-b  0/1    ContainerCreating 0         100m
file-pod-a   1/1    Running           0         9m11s


$ kubectl apply -f file-pod-b.yaml
pod/file-pod-b created


$ kubectl get pods
NAME         READY  STATUS            RESTARTS  AGE
block-pod-a  1/1    Running           0         116m
block-pod-b  0/1    ContainerCreating 0         100m
file-pod-a   1/1    Running           0         9m29s
file-pod-b   1/1    Running           0         6s


$ kubectl get event --field-selector involvedObject.name=file-pod-b
LAST SEEN  TYPE    REASON                 OBJECT           MESSAGE
27s        Normal  Scheduled               pod/file-pod-b  Successfully assigned default/file-pod-b to k8s-worker7-02
27s        Normal  SuccessfulAttachVolume  pod/file-pod-b  AttachVolume.Attach succeeded for volume "pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac"
19s        Normal  Pulled                  pod/file-pod-b  Container image "gcr.io/google_containers/busybox:1.24" already present on machine
18s        Normal  Created                 pod/file-pod-b  Created container file-pod-b
17s        Normal  Started                 pod/file-pod-b  Started container file-pod-b

The first thing to notice are the events. These all look good and it would appear that the second Pod, Pod B, has been able to successfully mount the RWX Persistent Volume, even when it is already attached and mounted to Pod A. Excellent! The final step is to log into Pod B to check if we can also read and write to the volume.

$ kubectl exec -it file-pod-b -- /bin/sh

/ # mount | grep /mnt/volume1
10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 on /mnt/volume1 type nfs4 \
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,\
timeo=600,retrans=2,sec=sys,clientaddr=192.168.213.193,local_lock=none,addr=10.27.51.31)

/ # df | grep /mnt/volume1
4554263552 0 4554161152 0% /mnt/volume1

/ # df /mnt/volume1
Filesystem           1K-blocks      Used  Available   Use%  Mounted on
10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25
                     4554263552     0     4554161152  0%    /mnt/volume1

/ # cd /mnt/volume1

/mnt/volume1 # ls
CreatedByPodA

/mnt/volume1 # cd CreatedByPodA/

/mnt/volume1 # ls
sharedfile

/mnt/volume1/CreatedByPodA # cat sharedfile
Pod A was here

/mnt/volume1/CreatedByPodA # echo "Pod B was here too" >> sharedfile

So we can see the directory and file that were created on the share from Pod A. We just did a simple update to the file originally created by Pod A from Pod B. If we now flip back to the shell session we have open on Pod A, let’s see if we are able to see the update to the sharedfile.

/mnt/volume1/CreatedByPodA # cat sharedfile
Pod A was here
Pod B was here too

Looks like both Pods are successfully able to read and write to this RWX persistent volume, dynamically provisioned by vSAN File Services.

I hope this gives you a good idea about how vSAN File Services can be used for both traditional virtual machine workloads as well as newer containerized workloads. We saw how file shares on vSAN can be dynamically provisioned as persistent volumes, along with a storage class that reflects the desired availability and performance of the volume through storage policies. We also saw some neat UI enhancements although I haven’t shown them all in this post. The main take-away is that is doesn’t matter if a developer is using block based RWO volume or file based RWX volumes provisioned from vSAN, the vSphere administrator has full visibility into how the developer is consuming vSphere storage. This allows good communication to develop between a vSphere Administrator and the Kubernetes persona, whether that is a developer or a K8s admin. Either way, this is enabling a culture of Dev-Ops to happen in the organization.

If you want to learn more about vSAN File Service and the new CSI driver for file shares, check out this blog post from my good pal Myles. It includes a nice demo into how vSAN File Services works with our new CSI driver and integrates with CNS.

The manifests used for this demo are available on this public github repo.

6 Replies to “Read-Write-Many Persistent Volumes with vSAN 7 File Services”

    1. Something we are considering, but it would be extremely useful if you could provide more context Jean-Philippe.

      When or why would you use RWX block instead of RWX file?
      What is your use case? Which applications, etc?

      Thank you.

      1. Hi Cormac,

        Sorry, I did not explain myself clearly. I was referring to regular vSphere esxi hosts that are connected to central storage and not using VSAN. Right now the vSphere cloud provider does not permit read write many when you create a PV using a storage class based on it. ,

        Let’s say you deploy an app on your vsphere k8s cluster and you need replicas and persistent storage, in your deployment yaml file, you would set lets say 3 replicas and read write many to one PV.
        I know best practices dictates doing a stateful set. Would that create a PV directly attached to each node where you have a replica ?

        1. Hi Jean-Philippe – the vSphere Cloud Provider (in-tree VCP) only provisions block read-write-once. It does not do read-write-many, neither on block nor on file.

          In the manifest that you describe above you still would not be able to have multiple pods writing to the same PV, even if it was created as part of a deployment.

          You could perhaps used a Pod based NFS server, which has its PV on vSphere storage, and do RWX that way. I wrote about it here – https://cormachogan.com/2019/06/20/kubernetes-storage-on-vsphere-101-readwritemany-nfs/ – not sure if it meets your needs.

  1. Thanks Cormac,
    Yes I am aware the VCP does only RWO. In your article you said VMware now supports the dynamic provisioning of file shares on vSAN 7.0 with the File Service feature enabled.

    If I summarize my question, it would be : how about vSphere connected to regular central storage ? Any plans from VMware to have any RWX from vSphere CSI 2.0 eventually outside of vSAN ?

    I got the compatibility chart from one of your posts : https://cormachogan.com/2020/05/07/vsphere-csi-driver-versions-and-capabilities/ where it says (vsan only)

    Thank you for your time

    1. Ah – got it. Yes, it is something that is being considered. What might be useful is if you could share your use-case with me Jean-Philippe. What is the application that you have that requires block RWX? Many thanks.

Comments are closed.