I’m sure most readers will be somewhat familiar with VMware’s Project Pacific at this point. It really is the buzz of VMworld 2019. If I had to describe Project Pacific in as few words as possible, it is a merging of vSphere and Kubernetes (K8s) with the goal of enabling our customers to deploy new, next-gen, distributed, modern applications which may be comprised of container workloads or combined container and virtual machine workloads. Not only that, we also need to provide our customers with a consistent way of managing, monitoring and securing these new modern applications. This is where Project Pacific comes in, closing that gap between IT operations and developers.
Another interesting Project Pacific description I heard is that it is changing the consumption model of vSphere – we are now allowing vSphere resources to be consumed through Kubernetes. This will become clearer as we go through the post.
I published a short write-up on Project Pacific (and Tanzu) after the initial announcements from VMworld 2019 in the US. However VMworld 2019 in Barcelona last month had a whole bunch of deep dive sessions of Project Pacific. There were deep dives on the Supervisor Cluster, Native Pods and Guest Clusters. Now that we are openly discussing the inner working of this ground-breaking initiative, I thought I would try to describe some of it here. Personally, when I try to understand something, the best way I find is to try to describe it to someone else. Hopefully those of you who are also trying to get up to speed on Project Pacific will find this write-up useful. Let’s start with an overview of the Supervisor Cluster.
I will assume that most readers will understand the concept of a vSphere Cluster, which is essential a group of ESXi hosts for running virtual machine workloads. While Project Pacific continues to allow us to run traditional VM workloads on a vSphere cluster, it also extends the cluster to allow container workloads to run natively on it. In other words, Project Pacific is making a vSphere Cluster the same as a Kubernetes Cluster. Well, that statement is maybe not 100% accurate, but we can certainly think of a Supervisor Cluster as a group of ESXi hosts running virtual machine workloads, while at the same time acting as Kubernetes worker nodes and running container workloads. This is all configured and managed through the vSphere client by a vSphere administrator when Project Pacific is enabled. So how is it deployed? Let look at that next:
Supervisor Cluster Deployment
vCenter Server will include a number of new components to enable Project Pacific, and the creation of a Supervisor Cluster. The first of these is the Supervisor Control Plane Image, which will allow us to deploy the Kubernetes master nodes as a set of VMs on the ESXi hosts. These will be an odd number of VMs and will also maintain etcd, the KV (key-value) store that is essentially the Kubernetes database. These are also deployed with anti-affinity rules to ensure that any failure or maintenance on the cluster does not bring down the control plane. These control plane VMs, as well as running the traditional Kubernetes API server, also have a bunch of vSphere extensions around networking (NCP – NSX-T Container Plugin), storage (CNS), scheduling and authentication, the latter enabling users to login to K8s Guest Clusters using their vSphere SSO credentials.
Another significant part is the Spherelet Bundle, which is the component that allows an ESXi host behave as a Kubernetes worker node. The Workload Platform Service on vCenter installs a spherelet daemon on every ESXi host in the cluster when Project Pacific is enabled. It also exposes a REST API that enables the management of namespaces which we will read about shortly.
Finally, there is the Token Exchange Service. This relates to authentication and allows us to take vSphere SSO Service SAML credentials and convert them to JSON Web Tokens (JWT) for use with systems like Kubernetes, allowing a vSphere user to login to a Kubernetes Guest Cluster. The login mechanism utilizes K8s RBAC. This means that you can also create custom K8s Role Bindings for users who are not part of the vCenter SSO.
This slide, taken from the Supervisor Cluster Deep Dive session at VMworld (HBI1452BE) by Jared Rosoff and Mark Johnson shows the Project Pacific components that we plan to include in vCenter Server.
We introduced a few critical concepts in that last section, such as namespace and spherelet. Let’s now delve deeper into those as that will help us understand more about the Supervisor Cluster architecture.
If you have a Kubernetes background, the first thing to mention is that namespaces in a Supervisor Cluster are very different to namespaces in Kubernetes. Namespaces in the context of a Supervisor cluster can be though of simply as Resource Pools to isolate a set of CPU and Memory resources for a given application or indeed user. The vSphere admin/Supervisor Cluster admin can also setup and modify attributes of the namespace to determine who can access it. An admin can also configure which storage policies are visible to the namespace which in turn determines which datastores the namespace has access to.
Basically we can have namespaces created for the running of traditional VMs, running of Native Pods, a combination of both or even the deployment of a complete upstream Kubernetes Guest Cluster.
Namespaces are how we are achieving multi-tenancy and isolation in Project Pacific.
Now we’ve mentioned the fact that ESXi hosts can now work as Kubernetes worker nodes. How exactly do we do that? Well, in vanilla Kubernetes, the worker nodes run a Kubernetes agent, called the kubelet. It manages Pods and manage containers in those Pods. The kubelet watches the API server on the control plane/master for new Pods, takes care of configuring and mounting external storage to the node on behalf of containers. It also configures networking for the Pods and nodes, creating pseudo NICs and putting in place all of the bridging required by a container.
Once the kubelet determines that the node is fully configured and is ready to run containers, it informs the container run-time to say that it can now start containers. The most popular container run-time in Kubernetes today happens to be docker.
The kubelet continuously probes the Pod and Container health, monitors the communication endpoints, as well as killing and restarting them when necessary – more or less taking care of the life-cycle management of the Pods and Containers on that node. It is also a controller, so that it constantly polls the API server to see if there are new Pods that have been scheduled on its node.
To make an ESXi host into a K8s node, we introduce the concept of a spherelet. The spherelet runs on the ESXi host and just like the kubelet, monitors the API server for new Native Pod requests. However we had to move some of the functionality that was found in a kubelet into separate controllers. For networking, we have a controller in NCP (NSX-T Container Plugin) that monitors the API server and applies any changes made to the networking configuration. Similarly, there is a controller in vCenter that monitors the API server for changes in the storage configuration, and applies any changes made to the storage.
The spherelet is only responsible for the Native Pods on that ESXi host/node. I realize we have not yet introduced the concept of Native Pods – we will come to this directly, but suffice to say that Native Pods are like K8s Pods in that they group containers into a single unit of management. But there are some difference which we will cover in a moment. Once the spherelet on the ESXi host has determined that networking and storage has been configured correctly, and that the Native Pod is now ready to run containers, it passes control over the spherelet agent which runs inside of the Native Pod. This takes care of the container disk mounts, any network bridging and running the containers. Once a container is launched, the responsibility for the container stays with the spherelet agent in the Native Pod – it is no longer the responsibility of the spherelet on the node. The spherelet agent will issue all of the probes for necessary for container health, provide the interactive endpoint for container communication (shell access, retrieve logs) and also deciding whether or not a container should be killed or restarted and so on.
Now, when we discussed the spherelet and the spherelet agent in the last section, we already covered quit a bit on the concept of the Native Pod. Since the ESXi host is now a Kubernetes node, it has the ability to run Pods, what we are calling Native Pods. The Native Pod can be thought of as a virtual machine, but a very light-weight, optimized, highly opinionated and fast booting virtual machine which in turn can run containers.
However, at its core, a Native Pod is really just a flavor of what we call a CRX instance. A CRX instance is a very special form of VM which provides a Linux Application Binary Interface (ABI) through a very isolated environment. Let’s take a closer look at CRX next.
CRX – Container Run-time for ESXi
As I just said, the CRX instance is a very special form of VM. When the CRX instance starts, it also launches a single controlling process. This is what makes it very similar to containers and why we called it the Container Run-time for ESX.
VMware provides the Linux Kernel image used by CRX instances. It is packaged with ESXi and maintained by VMware. When a CRX instance is brought up, ESXi will push the Linux image directly into the CRX instance.
CRX instances have a CRX init process which is responsible for providing the endpoint to allow communication with the ESXi, and allow the environment running inside of the CRX instance to be managed. This is shown in the diagram on the right here.
CRX instances have 2 modes. The spherelet is responsible for injecting the personality of the CRX instance when it starts up. The first, called managed mode, which is what a Native Pod is, means that it is visible to vCenter Server as a virtual machine, its resource usage can be examined, its IP address checked, etc. An un-managed CRX means that the CRX instance is treated as an application or daemon, and is not visible to vCenter Server. The image service in Project Pacific, which provides images to the containers running in a Native Pod, is an example of the latter. It only exists for the duration of downloading and extracting an image. The job of the image service is to download and extract images to a location where they can be mounted by pods.
The personalities for CRX instances are provided by ESXi VIBs and are injected into the CRX Instance. For Native Pods, the spherelet agent and the OCI compatible run-time, libcontainer are added to the CRX instance. The spherelet agent is linked to the run-time and is thus able to manage containers in the Native Pod. This is what is shown on the left hand side here.
Before we leave Native Pods, here is one last screen shot that I borrowed from the (HBI4501BE) Native Pods Deep Dive session at VMworld 2019, by Adrian Drzewiecki and Benjamin Corrie which gives a good idea of the various moving parts around a Native Pod, the role of the spherelet, spherelet agent and how storage and network controllers are done outside of the spherelet, which is the major difference when compared to a kubelet. The CRX instance and Native Pods diagrams above came from the same session.
One final note on CRX and Native Pods, and this is to do with vMotion. K8s has no concept of moving a Pod from one node to another in an orchestrated fashion. If a K8s node fails, the Pods get rescheduled on another node elsewhere in the cluster. Users of Native Pods need to understand that their applications need to be deployed as highly available, either as Deployments/ReplicaSets or as StatefulSets (the latter if Persistent Storage is required for the Pods). Native Pods do understand Maintenance Mode however, and the Native Pods will be shutdown on the host entering Maintenance Mode, meaning the K8s scheduler will restart them elsewhere in the cluster. If you don’t want your application to go offline during this time, you need to make the application highly available. The same is true for applications deployed natively on vanilla Kubernetes clusters.
OK – at this point, we can leave the Supervisor Cluster, Native Pods and CRX. Let now turn our attention to one of the other major features of Project Pacific and that is the ability to deploy upstream Kubernetes Guest Clusters in a namespace.
The final component of Project Pacific that we have not yet covered is the concept of a Guest Cluster. This basically means we can take a namespace and deploy an upstream Kubernetes cluster, a Guest Cluster. This means that as an admin you can isolate a bunch of resources through namespaces for this cluster, assign storage policies so that the developers can create persistent volumes, choose who have access to the cluster and provide this environment to your developer(s).
Now there is very much a layered approach to how Guest Clusters are deployed in a namespace in Project Pacific. We have introduced a number of operators to simplify the deployment as much as possible. And while there are many ways to deploy the cluster, the aim here was to make it as simple as possible and to hide the complexity of some of the lower layers.
At the top-most layer, we have the concept of a Guest Cluster Manager. With a simple YAML manifest file, we can request the creation of a Guest Cluster with X number of control plane/master nodes and Y number of worker nodes. We can also specify which distribution of Kubernetes we wish to have deployed and alongside some other basic information about networking, we can run a simple kubectl apply and begin the deployment of the cluster.
The Guest Cluster Manager produces specifications that can be consumed by Cluster API, and this information is now passed to Cluster API Controllers. This is a relatively new community driven project to bootstrap Kubernetes cluster using Kubernetes.
The final stage is that the Cluster API Controllers provide a set of VM resources that are consumed by the VM Operator which interfaces with vSphere Control Plane to achieve the desired state of our Guest Cluster, e.g. create the specified number and type of master nodes, create the specified number of type ofworker nodes.
This screenshot taken from the Guest Cluster Deep Dive session (HBI4500BE) by Derek Beard and Zach Shepherd give us a pretty good idea on how all these layers tie together.
Now, my understanding is that there is nothing to stop you interacting at the Cluster API layer if you need to get some more granularity than what is offered at the Guest Cluster Manager layer. And no doubt, if you knew what you were doing you could also work at the VM Operator layer to deploy out your the virtual machines for the control plane and worker nodes that your cluster requires, but I’m not sure why you would need to work at this level. Again, the whole purpose of this 3-layer approach is to simplify the deployment of Guest Clusters whilst hiding the complexity of the lower layers.
Now that the cluster has been deployed, we included a plugin to K8s that will allow developers to login. Here you can see it being extracted and used. When the developers logs in, they get visibility into the resources (namespaces) available to them, as shown in this screenshot take from break out session HBI1452BE:
It should be noted that someone with ‘edit’ access on a namespace created on the Supervisor cluster automatically gets cluster admin privileges on a Guest Cluster provisioned in that namespace – this means they have full control over K8s cluster RBAC after that.
One question I noticed that came up time again in the VMworld deep dive sessions was around upgrades, both for the Supervisor cluster and the Guest cluster. The Guest Clusters are easy – you just download the new distribution to your content library, specify a new version of the K8s distribution in the Guest Cluster Manager YAML manifest and apply it. Once the image is available, it will run through a rolling upgrade of your Guest Cluster.
For the Supervisor Cluster, we will release patches with vSphere and you simply apply them through VUM (VMware Update Manager) which will also go through a rolling upgrade mechanism of your ESXi hosts.
From a networking perspective, NSX is used at the Supervisor Cluster layer. There is an L4 Load Balancer built on the NSX Edge, fronting the control plane so that if one master goes down, it doesn’t impact access. It also means that if there is any maintenance tasks or upgrades going on in the system, once again access to the control plane is not impacted. There is also a Distributed Load Balancer across all hosts to act as a kube-proxy for Pod-to-Pod east-west traffic. Here is an excellent diagram on Supervisor Cluster networking taken from HBI1452BE session:
Now I haven’t included anything regarding what the Guest Clusters are using for networking. This was not covered in the deep dive that I could see, so it may well be something that is still being worked on.
As you can see, this is a major undertaking for VMware, with a huge development effort underway to bring Project Pacific to market. And as you can imagine, there is a huge amount of interest as this will enable a single platform to run traditional virtual machine workload, new container based applications, or indeed new, distributed, modern applications that could use a combination of the two together. Also, the concept of a namespace creates a new resources management entity which could hold both VMs and Pods. So now we have a way for developers to consume vSphere resources for their modern applications, while continuing to give infrastructure administrators full visibility into how those resources are being consumed. Exciting times!
While I have taken snippets from multiple deep dives delivered at VMworld this year, there is nothing like the real thing. Click on the links above and watch the recordings – they’ll provide with far more detail than what I have written here.