After reaching out, last week I was given a briefing by Michael Ferranti of Portworx. I started off by asking if this solution could work with vSphere. Michael stated that it absolutely does, and that they already have customers running this solution on vSphere, and other customers are evaluating it on vSphere. Their product is a pure software product, is installed as a set of containers which takes the host’s storage, aggregates it and make it available to applications in containers. Now, host in this context is a container hosts, where the docker daemon would run. So for physical, this host would be some Linux distribution running directly on bare-metal. For virtualized environments, this would be a VM running a Linux Guest OS. I clarified with Michael on how this would look like on vSphere. Essentially, you would have a number of ESXi hosts with some storage (SAN, NAS, local, vSAN), and then create one or more VMs per ESXi (running a Linux Guest OS), and the VMs would have the physical storage presented as VMDKs or Virtual Machine disks. In each VM, one would then run the docker daemon and deploy Portworx. Portworx would be deployed in the VMs (let’s now refer to them as container hosts), starting with the first container host. Further installs of Portworx “nodes” would then take place in the other container hosts, forming a much large cluster. As storage is discovered on the container hosts, it is “virtualized” by Portworx, aggregated or pooled together, and made available to all of the container hosts. Portworx requires a minimum of 3 container hosts, which implies that if you want host level availability on vSphere, one would also required 3 x ESXi hosts minimum, one for each container host. However it was also made clear that as well as storage nodes, Portworx also supports the concept of storage-less (or compute only) nodes.
The installation of Portworx is done with a simple “docker run” command (with some options), and storage is identified by providing an option to one or more storage devices (e.g. /dev/sdX). As each drive is added to the aggregate, it is bench-marked and identified as high, medium or low-class of service. Once the Portworx cluster is formed, docker volumes can then be created against the aggregate, or you can use Portworx’s own CLI called pxctl. The sorts of things one would specify on the command line are maximum size, IOPS requirement, the file system type (e.g. EXT4 or XFS) and availability levels, which is essentially how many copies of the volume to make in the cluster. Of course, the availability level will depend on the number of container hosts available. Portworx also have a UI interface called Lighthouse for management and monitoring. The file system type in theory could be any file system supported by the OS, but for the moment Portworx are supporting EXT4 and XFS. Michael also said that the user can create a “RAW” (unformatted volume), and then format it with any filesystem supported by the Linux distro running in the container host.
To make the volumes available to all container hosts, Portworx has its own docker volume plugin. Now what you see on the container hosts are Portworx virtual volumes. This means that should an application have one of its containers fail/restart on another host in the cluster, the same volume is visible on this new container host. An interesting aspect to this is data locality. Portworx always try to keep the application container and its volumes on the same container hosts. Portworx place labels on hosts so that the scheduler places the container on the correct host. There is no notion of moving the data to a particular container host, or moving application container to the data. This is left up to the scheduler. Of course, this isn’t always possible so there is a notion of a shared volume which can be mounted to multiple container hosts, similar to NFS style semantics (but it is not NFS).
One question I had is how do you determine availability when multiple container hosts are deployed in the same hypervizor? There are ways of doing this, tagging being one of them. They can also figure out Rack Awareness. How they do this depends on if it is cloud or on-prem? Michael used Cassandra as an example. In the cloud, Portworx automatically handles placement by querying a REST API endpoint on the cloud providers for zone & region information associated with each host. Then, with this understanding of your datacenter topology, Portworx can automatically place your partitions across fault domains, providing extra resilience against failures even before employing replication. Therefore, when deployed in something like EBS, Portworx can determine which availability zone they are deployed in. With on-prem, Portworx can also influence scheduling decisions by reading the topology defined in a “yaml” file like cassandra-topology.properties.
I asked about Data Services next. Portworx can of course leverage the data services of the underlying physical storage, but there have a few of their own as well. All volumes are thin provisioned. They have snapshot capabilities which can be scheduled to do full container volume snapshots, or simply snapshot the differences for backup. If I understood correctly, these snapshots could be redirected to an S3 store, even on a public cloud. Portworx call this CloudSnap. They can also grow volumes on-the-fly, and the container consuming the volume will automatically see the new size. There is also an encryption features, and Portworx supports leading KMS systems like Hashicorp Valut, AWS KMS and Docker Secrets, so different users can encrypt data with their own unique key. This is for data in flight as well as data at rest. The encryption is container-centric, meaning that three different containers would have three different keys. Michael also mentioned that they do have deduplication, but this is not available to the primary data copy. However it is available for the replica copies. Finally, there is a cloning feature, and volumes can be cloned to with a new read-write volume or a new read-only volume.
So now that you have Portworx running, what are the next steps? Well, typically you would now run a scheduler/cluster/framework on top, such as Docker Swarm, Kubernetes or Mesosphere. What makes it interesting is that now that Portworx has placed the different storage in “buckets” of low, medium and high, a feature like Kubernetes storage classes can be used to orchestrate the creation of volumes based on an application’s storage requirements. Now you can run different container applications on different storage classes. But what this means is that since the underlying storage is abstracted away, the dev-ops teams who are deploying applications should not have to worry about provisioning it – they simply request the class of storage they want. And the storage required by applications within containers can be provisioned alongside the application by the scheduler. This applies no matter if it is deployed in the cloud or on-prem.
From an operational perspective, if a drive on the container hosts fails, Portworx marks it as dead, and depending on the available resources and the availability factor, a new copy of the data is instantiated. The administrator would then either remediate or remove the bad drive – this workflow is done via the Portworx CLI. Similarly, if a Portworx container (responsible for providing the Portworx virtual devices on a container host) fails, the expectation is that the application – or specifically, the OS running the application – in a container will be able to hold off the I/O until a new Portworx container can be spun up. Under the cover, Portworx use RAFT for cluster consensus, Gossip for the control path (so nodes know what is going on in the cluster) and ZeroMQ for storage replication.
Portworx are betting on containers becoming mainstream. If this does happen, then these guys will be well positioned to provide a storage solution for containers. If you wish to try it out, pop over to https://docs.portworx.com/. This has instructions on how to install Portworx with all the popular schedulers. If I read the terms and conditions correctly, you can set up a 3-node cluster without a license. Thanks to Michael and Jake for taking time out of their schedule to give me this briefing.