Read-Write-Many Persistent Volumes with vSAN 7 File Services
A few weeks back, just after the vSphere 7.0 launch event, I wrote an article about Native File Services in vSAN 7.0. I had a few questions asking why we decided on NFS support in this initial release, and not something like SMB or some other protocol. The reason is quite straight-forward. We are positioning vSAN as a platform for both traditional virtual machine workloads and newer containerized workloads. We chose NFS to address a storage requirement in Kubernetes, namely a way to share Persistent Volumes between Pods. To date, the vSphere CSI driver only provisioned block based Persistent Volumes which were Read-Write-Once, meaning that only one Pod could consume the volume at a time. With the new vSphere CSI driver, VMware now supports the dynamic provisioning of file shares on vSAN 7.0 with the File Service feature enabled. These shares can then be consumed by container workloads. In this post, I want to show you how this works.
Disclaimer: “To be clear, this post is based on a pre-GA version of the both vSAN 7 File Services and the new vSphere CSI driver. While the assumption is that not much should change between the time of writing and when the products becomes generally available, I want readers to be aware that feature behaviour and the user interface could still change before then.”
I’m not going to spend any time talking about the deployment and configuration of vSAN 7 File Services as this has already been covered in the earlier post. In that post, we saw how to manually create an NFS file share. In this post, we will see how a file share is dynamically instantiated when a Kubernetes application requests a Read-Write-Many (RWX) Persistent Volume using a StorageClass that refers to vSAN file services, and the Kubernetes(K8s) cluster is deployed on vSphere with the new CSI driver.
ReadWriteOnce (Block) Persistent Volumes Revisited
Let’s start with a quick deployment of an application that uses Block Persistent Volumes, just so we can look at how that behaves on vSphere/vSAN. This will show you some of the extensive features we have in the vSphere UI for managing and monitoring containers, as well as demonstrate the issues with trying to share a block PV between two Pods.
What I am doing here is:
- Create a StorageClass which is using a (block) RAID1 policy, implying the Persistent Volume will be instantiated on my vSAN datastore as a block VMDK.
- Create a ReadWriteOnce Persistent Volume Claim (PVC), to manually create a Persistent Volume (PV).
- Create a Pod what uses that PVC which in turn means that it gets the PV associated with the PVC.
- Launch another Pod with the same PVC which will demonstrate that the Pods cannot share the same RWO block volume.
Here are the YAML manifests that I will use for this demo. The kind field describes what each object is. This first manifest is the StorageClass, which put simply, select a vSphere datastore in which to place any provisioned Persistent Volumes. The provisioner field is a reference to the VMware CSI driver. The storagepolicyname field refers to an SPBM policy is vSphere. In this case, that policy will result in selecting my vSAN datastore.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: vsan-block-sc provisioner: csi.vsphere.vmware.com parameters: storagepolicyname: "RAID1"
This is the Persistent Volume Claim manifest. It results in the creation of a 2Gi Persistent Volume (VMDK) on vSphere storage reference by the StorageClassName. Since this StorageClassName refers to the StorageClass above, this PV will be created on my vSAN datastore.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: block-pvc spec: storageClassName: vsan-block-sc accessModes: - ReadWriteOnce resources: requests: storage: 2Gi
This is the Pod manifest. It will create a simple “busybox” Pod which mounts the volume referenced by the claimName block-pvc which is our PVC above. This will result in a 2G VMDK attached and mounted on /mnt/volume1 in the Pod.
apiVersion: v1 kind: Pod metadata: name: block-pod-a spec: containers: - name: block-pod-a image: "k8s.gcr.io/busybox" volumeMounts: - name: block-vol mountPath: "/mnt/volume1" command: [ "sleep", "1000000" ] volumes: - name: block-vol persistentVolumeClaim: claimName: block-pvc
This is the second Pod which is identical to the first. It will also try to attach and mount the same PV. However since this is a RWO persistent volume, and it is already attached and mounted to the first Pod, we will see the attach operation of the PV to this Pod fail.
apiVersion: v1
kind: Pod
metadata:
name: block-pod-b
spec:
containers:
- name: block-pod-b
image: "k8s.gcr.io/busybox"
volumeMounts:
- name: block-vol
mountPath: "/mnt/volume1"
command: [ "sleep", "1000000" ]
volumes:
- name: block-vol
persistentVolumeClaim:
claimName: block-pvc
Let’s create each object, and monitor the vSphere client for changes/updates. Let’s create the StorageClass first.
$ ls block-pod-a.yaml block-pod-b.yaml block-pvc.yaml block-sc.yaml $ kubectl apply -f block-sc.yaml storageclass.storage.k8s.io/vsan-block-sc created $ kubectl get sc NAME PROVISIONER AGE vsan-block-sc csi.vsphere.vmware.com 4s
Next step is to create the PVC and query the resulting PV.
$ kubectl apply -f block-pvc.yaml persistentvolumeclaim/block-pvc created $ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE block-pvc Bound pvc-6398f749-f93e-4dfd-8121-cf91926d642e 2Gi RWO vsan-block-sc 10s $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-6398f749-f93e-4dfd-8121-cf91926d642e 2Gi RWO Delete Bound default/block-pvc vsan-block-sc 13s
And this volume is now visible in the Container Volumes view in the vSphere client, thanks to CNS.
We can see that it has been instantiated as a RAID-1 (Mirror) on the vSAN datastore. If we click on the volume name, and then View Placement Details, we can see that the actual layout of the PV across vSAN hosts and vSAN disks, allowing vSphere admins to track right down and see exactly which infrastructure components are backing a volume.
Another useful feature for the vSphere admin is to identify “stranded” persistent volumes, in other words identify PVs that are not attached to any Pod. We have not yet deployed any Pod, so let’s see how the PV appears in the vSAN > Capacity > Usage breakdown. Here I have expanded User objects. Note that it is identifying “Block container volumes (not attached to a VM)”. By VM, we mean Kubernetes worker node. This is why it appears in the User object report and not the VM report. For a Pod to mount a volume, that volume needs to be attached to the Kubernetes node, and the kubelet running inside the node will then take care of making it available to the Pod. Since we do not have any Pod yet to consume this volume, the PV’s capacity appears under the “not attached” section.
I can click on the User objects in the chart to get a more detailed view.
Now the numbers here are pretty small since I have only provisioned a single PV and vSAN always provisions thin disks. But hopefully it gives you a good idea of the level of detail we can get, and how this can really help a vSphere admin who is managing vSphere infrastructure which is hosting Kubernetes clusters.
Let’s now go ahead and provision our first Pod.
$ kubectl apply -f block-pod-a.yaml pod/block-pod-a created $ kubectl get pod NAME READY STATUS RESTARTS AGE block-pod-a 0/1 ContainerCreating 0 4s $ kubectl get pod NAME READY STATUS RESTARTS AGE block-pod-a 1/1 Running 0 16s
The Pod is ready. We can use the following command to examine the events associated with the creation of the Pod.
$ kubectl get event --field-selector involvedObject.name=block-pod-a LAST SEEN TYPE REASON OBJECT MESSAGE 1m Normal Scheduled pod/block-pod-a Successfully assigned default/block-pod-a to k8s-worker7-02 1m Normal SuccessfulAttachVolume pod/block-pod-a AttachVolume.Attach succeeded for volume "pvc-6398f749-f93e-4dfd-8121-cf91926d642e" 1m Normal Pulling pod/block-pod-a Pulling image "k8s.gcr.io/busybox" 1m Normal Pulled pod/block-pod-a Successfully pulled image "k8s.gcr.io/busybox" 1m Normal Created pod/block-pod-a Created container block-pod-a 1m Normal Started pod/block-pod-a Started container block-pod-a
It looks like the volume was successfully attached. Let’s log onto it and check to see if it mounted the volume successfully.
$ kubectl exec -it block-pod-a -- /bin/sh / # mount | grep /mnt/volume1 /dev/sdb on /mnt/volume1 type ext4 (rw,relatime,data=ordered) / # df /mnt/volume1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb 1998672 6144 1871288 0% /mnt/volume1 / #
This looks good as well. And now that the volume has been attached to a VM (the K8s worker node where the Pod was scheduled), the Usage breakdown now changes how it is reported. It now appears in the VM category, under Block container volumes (attached to a VM). So it is not longer stranded. Very useful.
Now let’s show you what happens if we try to deploy another Pod that also attempts to attach and mount this ReadWriteOnce block volume:
$ kubectl apply -f block-pod-b.yaml pod/block-pod-b created $ kubectl get pod NAME READY STATUS RESTARTS AGE block-pod-a 1/1 Running 0 15m block-pod-b 0/1 ContainerCreating 0 7s $ kubectl get event --field-selector involvedObject.name=block-pod-b LAST SEEN TYPE REASON OBJECT MESSAGE 16s Normal Scheduled pod/block-pod-b Successfully assigned default/block-pod-b to k8s-worker7-01 16s Warning FailedAttachVolume pod/block-pod-b Multi-Attach error for volume "pvc-6398f749-f93e-4dfd-8121-cf91926d642e" Volume is already used by pod(s) block-pod-a
As expected, we failed to attach the volume to Pod B since RWO volumes cannot be simultaneously attached to multiple Pods, and the volume is already in use by Pod A. Since Kubernetes is an eventually consistent system, it continues to try and reconcile this request to create the container. However this will never succeed, and will remain in this ContainerCreating state until we tell it to stop.
Let’s now finish with RWO block volumes and take a closer look at RWX file volumes.
ReadWriteMany (File) Persistent Volumes
Just like we did with block volumes, I am going to do a similar demonstration with file volumes. The steps will be:
- Create a StorageClass which is using a (file share) RAID1 policy, implying the Persistent Volume will be instantiated on my vSAN datastore as an NFS file share.
- Create a ReadWriteMany Persistent Volume Claim (PVC), to manually create a Persistent Volume (PV).
- Create a Pod what uses that PVC which in turn means that it gets the PV associated with the PVC.
- Launch another Pod with the same PVC which will demonstrate that these Pods can share the same RWX volume.
Here are the manifests. They are similar to the block manifests in many ways. However there is a new optional parameter in the StorageClass so that the file system type (fstype) can be specified.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: vsan-file-sc provisioner: csi.vsphere.vmware.com parameters: storagepolicyname: "RAID1" csi.storage.k8s.io/fstype: nfs4
[Update] In an earlier version of this post (pre-GA), I showed parameters such as allowroot, permission, and ips in the StorageClass manifest. In the GA version of the CSI driver, these parameters were moved to the vsphere.conf, the CSI driver configuration file. Click here to see examples of file share entries in the vsphere.conf. If you need to add new entries to the vsphere.conf, e.g. permissions or IP ranges for new volumes, you do not need to redeploy the CSI driver. simply modify the vsphere.conf configuration with the new details, update the secret and the new configuration will be loaded after a short wait, the time defined by the kubelet sync period and cache propagation delay. For file volumes which are already provisioned, permissions cannot be modified.
The PersistentVolumeClaim manifest is almost identical to the previous block example. The major difference is that the accessMode is now set to ReadWriteMany whereas previously it was set to ReadWriteOnce.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: file-pvc spec: storageClassName: vsan-file-sc accessModes: - ReadWriteMany resources: requests: storage: 2Gi
The Pod manifests are also very similar to before. The idea here is to demonstrate that both Pods can attach, mount and write to the same ReadWriteMany shared volume simultaneously.
apiVersion: v1 kind: Pod metadata: name: file-pod-a spec: containers: - name: file-pod-a image: "k8s.gcr.io/busybox" volumeMounts: - name: file-vol mountPath: "/mnt/volume1" command: [ "sleep", "1000000" ] volumes: - name: file-vol persistentVolumeClaim: claimName: file-pvc
This is the second Pod which will try to share the RWX persistent volume with the first Pod.
apiVersion: v1 kind: Pod metadata: name: file-pod-b spec: containers: - name: file-pod-b image: "k8s.gcr.io/busybox" volumeMounts: - name: file-vol mountPath: "/mnt/volume1" command: [ "sleep", "1000000" ] volumes: - name: file-vol persistentVolumeClaim: claimName: file-pvc
Let’s begin the same way as we did for block, first creating the storage class, then the PVC. We will then see what changes have occurred on the vSphere client UI. I won’t remove the block objects created previously so that you can compare them to the objects created for file.
$ kubectl apply -f file-sc.yaml storageclass.storage.k8s.io/vsan-file-sc created $ kubectl get sc NAME PROVISIONER AGE vsan-block-sc csi.vsphere.vmware.com 76m vsan-file-sc csi.vsphere.vmware.com 5s $ kubectl apply -f file-pvc.yaml persistentvolumeclaim/file-pvc created $ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE block-pvc Bound pvc-6398f749-f93e-4dfd-8121-cf91926d642e 2Gi RWO vsan-block-sc 74m file-pvc Bound pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac 2Gi RWX vsan-file-sc 25s $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac 2Gi RWX Delete Bound default/file-pvc vsan-file-sc 17s pvc-6398f749-f93e-4dfd-8121-cf91926d642e 2Gi RWO Delete Bound default/block-pvc vsan-block-sc 74m
Note the access mode on the PVC and PV. The new volume is RWX, meaning ReadWriteMany. Let’s see what changes have occurred in the vSphere client after running the above commands. If we navigate to Configure > vSAN > File Service Shares, we observe that a new dynamically create file share now exists, of type Container Volume.
And if we click on the name of the volume, we once again get taken to the Monitor > vSAN > Virtual Objects view where we can see the object placement details, just like we could for the block volume previously. And the container volume also appears in the Container Volumes view as well.
Let’s now deploy the first of our Pods and see what we get. Notice that the second block Pod is still ‘ContainerCreating’ from earlier, and will continue to do so indefinitely.
$ kubectl apply -f file-pod-a.yaml pod/file-pod-a created $ kubectl get pod NAME READY STATUS RESTARTS AGE block-pod-a 1/1 Running 0 68m block-pod-b 0/1 ContainerCreating 0 52m file-pod-a 1/1 Running 0 22s $ kubectl get event --field-selector involvedObject.name=file-pod-a LAST SEEN TYPE REASON OBJECT MESSAGE 51s Normal Scheduled pod/file-pod-a Successfully assigned default/file-pod-a to k8s-worker7-01 51s Normal SuccessfulAttachVolume pod/file-pod-a AttachVolume.Attach succeeded for volume "pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac" 42s Normal Pulled pod/file-pod-a Container image "gcr.io/google_containers/busybox:1.24" already present on machine 42s Normal Created pod/file-pod-a Created container file-pod-a 42s Normal Started pod/file-pod-a Started container file-pod-a
The Pod was created and the volume was attached successfully, as per the events above. OK – let’s do something simple on the volume to show that we can read and write to it.
$ kubectl exec -it file-pod-a -- /bin/sh /# mount | grep /mnt/volume1 10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 on /mnt/volume1 type nfs4 \ (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,\ timeo=600,retrans=2,sec=sys,clientaddr=192.168.232.1,local_lock=none,addr=10.27.51.31) / # df /mnt/volume1 Filesystem 1K-blocks Used Available Use% Mounted on 10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 4554505216 0 4554402816 0% /mnt/volume1 / # cd /mnt/volume1 /mnt/volume1 # mkdir CreatedByPodA /mnt/volume1 # cd CreatedByPodA /mnt/volume1/CreatedByPodA # echo "Pod A was here" >> sharedfile /mnt/volume1/CreatedByPodA # cat sharedfile Pod A was here
The next step is to launch our second Pod, and make sure it can share access and mount the same volume.
$ kubectl get pods NAME READY STATUS RESTARTS AGE block-pod-a 1/1 Running 0 116m block-pod-b 0/1 ContainerCreating 0 100m file-pod-a 1/1 Running 0 9m11s $ kubectl apply -f file-pod-b.yaml pod/file-pod-b created $ kubectl get pods NAME READY STATUS RESTARTS AGE block-pod-a 1/1 Running 0 116m block-pod-b 0/1 ContainerCreating 0 100m file-pod-a 1/1 Running 0 9m29s file-pod-b 1/1 Running 0 6s $ kubectl get event --field-selector involvedObject.name=file-pod-b LAST SEEN TYPE REASON OBJECT MESSAGE 27s Normal Scheduled pod/file-pod-b Successfully assigned default/file-pod-b to k8s-worker7-02 27s Normal SuccessfulAttachVolume pod/file-pod-b AttachVolume.Attach succeeded for volume "pvc-216e6403-fd0c-48ea-bd05-c245a54d72ac" 19s Normal Pulled pod/file-pod-b Container image "gcr.io/google_containers/busybox:1.24" already present on machine 18s Normal Created pod/file-pod-b Created container file-pod-b 17s Normal Started pod/file-pod-b Started container file-pod-b
The first thing to notice are the events. These all look good and it would appear that the second Pod, Pod B, has been able to successfully mount the RWX Persistent Volume, even when it is already attached and mounted to Pod A. Excellent! The final step is to log into Pod B to check if we can also read and write to the volume.
$ kubectl exec -it file-pod-b -- /bin/sh / # mount | grep /mnt/volume1 10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 on /mnt/volume1 type nfs4 \ (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,\ timeo=600,retrans=2,sec=sys,clientaddr=192.168.213.193,local_lock=none,addr=10.27.51.31) / # df | grep /mnt/volume1 4554263552 0 4554161152 0% /mnt/volume1 / # df /mnt/volume1 Filesystem 1K-blocks Used Available Use% Mounted on 10.27.51.31:/52890fc4-b24d-e185-f33c-638eabfa5e25 4554263552 0 4554161152 0% /mnt/volume1 / # cd /mnt/volume1 /mnt/volume1 # ls CreatedByPodA /mnt/volume1 # cd CreatedByPodA/ /mnt/volume1 # ls sharedfile /mnt/volume1/CreatedByPodA # cat sharedfile Pod A was here /mnt/volume1/CreatedByPodA # echo "Pod B was here too" >> sharedfile
So we can see the directory and file that were created on the share from Pod A. We just did a simple update to the file originally created by Pod A from Pod B. If we now flip back to the shell session we have open on Pod A, let’s see if we are able to see the update to the sharedfile.
/mnt/volume1/CreatedByPodA # cat sharedfile Pod A was here Pod B was here too
Looks like both Pods are successfully able to read and write to this RWX persistent volume, dynamically provisioned by vSAN File Services.
I hope this gives you a good idea about how vSAN File Services can be used for both traditional virtual machine workloads as well as newer containerized workloads. We saw how file shares on vSAN can be dynamically provisioned as persistent volumes, along with a storage class that reflects the desired availability and performance of the volume through storage policies. We also saw some neat UI enhancements although I haven’t shown them all in this post. The main take-away is that is doesn’t matter if a developer is using block based RWO volume or file based RWX volumes provisioned from vSAN, the vSphere administrator has full visibility into how the developer is consuming vSphere storage. This allows good communication to develop between a vSphere Administrator and the Kubernetes persona, whether that is a developer or a K8s admin. Either way, this is enabling a culture of Dev-Ops to happen in the organization.
If you want to learn more about vSAN File Service and the new CSI driver for file shares, check out this blog post from my good pal Myles. It includes a nice demo into how vSAN File Services works with our new CSI driver and integrates with CNS.
The manifests used for this demo are available on this public github repo.
Any plans for RWX support on VMFS datastores using vSphere CSI ?
Something we are considering, but it would be extremely useful if you could provide more context Jean-Philippe.
When or why would you use RWX block instead of RWX file?
What is your use case? Which applications, etc?
Thank you.
Hi Cormac,
Sorry, I did not explain myself clearly. I was referring to regular vSphere esxi hosts that are connected to central storage and not using VSAN. Right now the vSphere cloud provider does not permit read write many when you create a PV using a storage class based on it. ,
Let’s say you deploy an app on your vsphere k8s cluster and you need replicas and persistent storage, in your deployment yaml file, you would set lets say 3 replicas and read write many to one PV.
I know best practices dictates doing a stateful set. Would that create a PV directly attached to each node where you have a replica ?
Hi Jean-Philippe – the vSphere Cloud Provider (in-tree VCP) only provisions block read-write-once. It does not do read-write-many, neither on block nor on file.
In the manifest that you describe above you still would not be able to have multiple pods writing to the same PV, even if it was created as part of a deployment.
You could perhaps used a Pod based NFS server, which has its PV on vSphere storage, and do RWX that way. I wrote about it here – https://cormachogan.com/2019/06/20/kubernetes-storage-on-vsphere-101-readwritemany-nfs/ – not sure if it meets your needs.
Thanks Cormac,
Yes I am aware the VCP does only RWO. In your article you said VMware now supports the dynamic provisioning of file shares on vSAN 7.0 with the File Service feature enabled.
If I summarize my question, it would be : how about vSphere connected to regular central storage ? Any plans from VMware to have any RWX from vSphere CSI 2.0 eventually outside of vSAN ?
I got the compatibility chart from one of your posts : https://cormachogan.com/2020/05/07/vsphere-csi-driver-versions-and-capabilities/ where it says (vsan only)
Thank you for your time
Ah – got it. Yes, it is something that is being considered. What might be useful is if you could share your use-case with me Jean-Philippe. What is the application that you have that requires block RWX? Many thanks.