My highlights from KubeCon and CloudNativeCon, Day #3, Europe 2019
Today is the final day of KubeCon/CloudNativeCon here in Barcelona. I missed the morning keynotes as I was meeting with some of our DELL Technologies colleagues based here is Barcelona. They are working on something that is very cool, and I hope I’ll be able to share more with you later this year.
Anyway, once at the event, these are some of the sessions I attended. I wanted to try and catch a few presentations that were not storage orientated today, simply to get a better idea of what is happening in the broader K8s community. The first of these sessions was “Let’s Try Every CRI Runtime Available for Kubernetes. No, Really!” by Phil Estes of IBM. A container runtime is the code that is responsible for running containers and managing container images on a Kubernetes worker node. On reading the intro for this session, it talks about deciding which CRI implementation you would choose, as each have trade-offs. We got to see Phil demo the different runtimes. While there wasn’t really any deep discussion on which were the best runtimes one should use, my take-away from this session is (1) some runtimes are simple to deploy, others are more complex, (2) some runtimes implement all syscalls but then consume more resources, whilst others do not implement all syscalls, and consume less resources – therefore you need to understand the needs of your application and make sure your runtime can meet them and (3) you may have some application tools that rely on certain runtime endpoints (e.g docker), and these tools may not work if another runtime is used. The different runtimes that were shown were docker, containerd, CRI-O, Kata Containers, Firecracker, gVisor, Singularity, and Nabla.
The next session I attended was “Building a Controller Manager for your Cloud Platform”. This was presented by VMware’s own Fabio Rapposelli and Chris Hoge of OpenStack. Fabio started by explaining the need for Controller Manager (more commonly referred to as CCMs). He told us that there are currently 8 cloud providers built inside of K8s 1.14 – these cloud providers container specific code for the different platforms on which K8s runs. He said that most of this is unnecessary code as you are going to run your K8s on just one platform. He explained how this is hard to maintain for people who are not working on that particular platform and that K8s should not be in the business of maintaining this. He also said that since the cloud provider are in the code in-tree, it sort of implies that it is endorsed by K8s, which is not the case either. The biggest problem though, in my opinion, is that critical updates for these cloud providers can only be done with the K8s release cycle so it is very difficult to patch issues quickly, etc. So in order to support independent work, the CCM concept was created, which will run side-cars on the nodes. This will facilitate an “out of tree” provider mechanism and address the issues outlines here. Fabio finished by telling us that by the end of 2019, there will be no in-tree provider code in K8s. Chris then talked us through the steps of how to build your own controller manager, and what precautions you should take when doing so. Very good session, and for any in-tree cloud providers out there, worth a watch if you are planning your own CCM.
My final session of the day was “Latest K8s scaling improvements” by Yassine Tijani and Shyam Jeedigunta. Shyam started by saying that scalability is more than just number of nodes in a cluster. He mentioned that this was set at 5000 in 2017. Shyam and Yassine told us how this large number of supported nodes has ramifications for other parts of the system. For example, since the kubelets on all nodes do periodic heartbeats every 10secs, the status that is being sent back has, on some occasions in very large deployments, filled etcd datastore. The plan is to now seperate out the heartbeat from the status to avoid this. We were also told about changes to the scheduler. This is predominantly to do with anti-affinity in large clusters where lots of calculations have to be to made to ensure placement rules do not get broken, i.e. pods are placed on nodes which do not break the rules of it, or of any other pods on the same node. These changes, if I understood correctly, will just try to identify a few suitable node candidates rather than all suitable node candidates in the cluster. This will bring down the time to schedule a pod with anti-affinity on very large clusters. One other improvement was around events. Again, issues have been observed with repeated events filling up etcd and overloading apiserver. The plan is to create a new event object to avoid aggregation. Once again, if I understood correctly, we will now just increment a count associated with a repetitive events and track repetitive event on the client, without a need to send them to the API server or etcd. A very interesting session, and well worth watching.