Prometheus & Grafana Monitoring Stack on TKGS workload cluster in vSphere with Tanzu
In this post, we are going to build on the work already done when we deployed Carvel packages on a Tanzu Kubernetes workload cluster created by the TKG Service in vSphere with Tanzu. We saw in that post what the requirements are, how to use the tanzu command line to set context to a workload cluster, add the TKG v1.4 package repository. We also saw how to use the tanzu CLI to deploy our first package, which was cert manager. We will now continue with the deployment of a number of other packages, such as Contour (for Ingress), External-DNS (to connect to our existing DNS), Prometheus (for monitoring our cluster) and finally Grafana (for dashboards). The goal here is to stand up a monitoring stack to monitor the Tanzu Kubernetes cluster. What we will see is that these packages have be built to work well together in Tanzu Kubernetes with minimal configuration, and you should be able to get this monitoring stack stood up quickly without too much effort. We will look at how to retrieve the various configuration values from each package and deploy the packages with some bespoke values. Let’s look at that in detail.
Cert Manager (already installed)
We already deployed Cert Manager in the previous post. Cert Manager makes the monitoring app stack more secure. We use it to secure communications between Contour and the Envoy Ingress which we will deploy next. Cert Manager automates certificate management, providing certificates-as-a-service capabilities. As we saw previously, there is no requirement to supply any bespoke values to Cert Manager. The only configuration option is the namespace in which cert-manager deploys its package resources (default: cert-manager).
% tanzu package installed list \ Retrieving installed packages... NAME PACKAGE-NAME PACKAGE-VERSION STATUS cert-manager cert-manager.tanzu.vmware.com 1.1.0+vmware.1-tkg.2 Reconcile succeeded
Contour
The reason for installing Contour is because Prometheus has a requirement on an Ingress. Contour provides this functionality via an Envoy Ingress controller. Contour is an open source Kubernetes Ingress controller that acts as a control plane for Envoy. Since this deployment is using an NSX ALB, we can provide a bespoke Contour data values file to set the Envoy service type to Load Balancer, as well as set the number of Contour replicas. We can retrieve the configurable values using the following commands. First, we need to get the Contour packager version.
% tanzu package available list contour.tanzu.vmware.com / Retrieving package versions for contour.tanzu.vmware.com... NAME VERSION RELEASED-AT contour.tanzu.vmware.com 1.17.1+vmware.1-tkg.1 2021-07-23 19:00:00 +0100 IST
With the package version, we can retrieve the default Contour values. (Note that you use exactly the same methodology to retrieve the values file from every other Carvel package. So while I show the detailed procedure for the Contour package, I will not repeat it for the other packages we install in this post.)
% image_url=$(kubectl get packages contour.tanzu.vmware.com.1.17.1+vmware.1-tkg.1 \ -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}') % echo $image_url projects.registry.vmware.com/tkg/packages/standard/contour@sha256:73dc13131e6c1cfa8d3b56aeacd97734447acdf1ab8c0862e936623ca744e7c4 % mkdir ./contour % imgpkg pull -b $image_url -o ./contour Pulling bundle 'projects.registry.vmware.com/tkg/packages/standard/contour@sha256:73dc13131e6c1cfa8d3b56aeacd97734447acdf1ab8c0862e936623ca744e7c4' Extracting layer 'sha256:93c1f3e88f0e0181e11a38a4e04ac16c21c5949622917b6c72682cc497ab3e44' (1/1) Locating image lock file images... One or more images not found in bundle repo; skipping lock file update Succeeded %
The values file is now available in ./contour/config/values.yaml. These are the default values with which the package is deployed, but can be modified to meet various requirements. Here are the default Contour values:
infrastructure_provider: vsphere namespace: tanzu-system-ingress contour: configFileContents: {} useProxyProtocol: false replicas: 2 pspNames: "vmware-system-restricted" logLevel: info envoy: service: type: null annotations: {} nodePorts: http: null https: null externalTrafficPolicy: Cluster aws: LBType: classic disableWait: false hostPorts: enable: true http: 80 https: 443 hostNetwork: false terminationGracePeriodSeconds: 300 logLevel: info pspNames: null certificates: duration: 8760h renewBefore: 360h
I usually make a copy of the values file before making any changes. Note that not every field needs to be added to your own bespoke values file. For the purposes of deploying this package on my workload cluster, I simply want to use a service type of LoadBalancer for Envoy, and to use Cert Manager which was previously installed. Thus my bespoke values file for Contour would look something like this.
% cat contour.yaml envoy: service: type: LoadBalancer certificates: useCertManager: true
I can now deploy the Contour package with my bespoke values file which I have called contour.yaml, as follows:
% tanzu package available list contour.tanzu.vmware.com / Retrieving package versions for contour.tanzu.vmware.com... NAME VERSION RELEASED-AT contour.tanzu.vmware.com 1.17.1+vmware.1-tkg.1 2021-07-23 19:00:00 +0100 IST % tanzu package install contour --package-name contour.tanzu.vmware.com --version 1.17.1+vmware.1-tkg.1 \ --values-file contour.yaml / Installing package 'contour.tanzu.vmware.com' | Getting namespace 'default' / Getting package metadata for 'contour.tanzu.vmware.com' | Creating service account 'contour-default-sa' | Creating cluster admin role 'contour-default-cluster-role' | Creating cluster role binding 'contour-default-cluster-rolebinding' | Creating secret 'contour-default-values' \ Creating package resource \ Package install status: Reconciling Added installed package 'contour' in namespace 'default' %
You can check that the package has installed successfully in a number of different ways.
% kubectl get apps NAME DESCRIPTION SINCE-DEPLOY AGE cert-manager Reconcile succeeded 45s 138m contour Reconcile succeeded 92s 96s % tanzu package installed list - Retrieving installed packages... NAME PACKAGE-NAME PACKAGE-VERSION STATUS cert-manager cert-manager.tanzu.vmware.com 1.1.0+vmware.1-tkg.2 Reconcile succeeded contour contour.tanzu.vmware.com 1.17.1+vmware.1-tkg.1 Reconcile succeeded
You should be able to examine the package resource objects which have been deployed in the namespace tanzu-system-ingress (as configured in the default values file), including the envoy daemonset as shown below.
% kubectl get daemonset/envoy -n tanzu-system-ingress NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE envoy 3 3 3 3 3 <none> 2m7s
We should also observe that Envoy has been allocated a Load Balancer service.
% kubectl get svc -n tanzu-system-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE contour ClusterIP 100.69.120.194 <none> 8001/TCP 5m9s envoy LoadBalancer 100.71.42.85 xx.xx.62.20 80:30368/TCP,443:32723/TCP 5m8s
We can now move onto deploying External-DNS.
External-DNS
This is an optional package. It allows Kubernetes services to be automatically added to your external DNS. This is useful when we look at the Grafana dashboards later, as the FQDN can be used to access them rather than the IP address. In this example, I am integrating insecure connections to Microsoft DNS, as per these instructions. There are 2 steps to be carried out on the Microsoft DNS configuration for the domain that you plan to integrate with: (1) Allow both secure and non-secure dynamic updates, and (2) allow zone transfers to any server. I am using the RFC2136 provider. This allows me to use any RFC2136-compatible DNS server as a provider for External-DNS, such as Microsoft DNS. I am integrating with my rainpole.com domain. Included is a copy of my External-DNS values file. Note the inclusion of the rfc2136-insecure argument (support insecure dynamic updates) and the rfc2136-tsig-axfr (support zone transfers). Zone transfers are needed for the deletion of records. Notice also that the source has been set to contour-httpproxy, implying that services created using it will be added to the DNS, e.g. Prometheus and Grafana. We will see this later.
$ cat external-dns.yaml namespace: tanzu-system-service-discovery deployment: args: - --registry=txt - --txt-prefix=external-dns- - --txt-owner-id=tanzu - --provider=rfc2136 - --rfc2136-host=xx.xx.51.252 - --rfc2136-port=53 - --rfc2136-zone=rainpole.com - --rfc2136-insecure - --rfc2136-tsig-axfr - --source=service - --source=contour-httpproxy - --source=ingress - --domain-filter=rainpole.com
Let’s now deploy this package with the above values.
% tanzu package available list external-dns.tanzu.vmware.com - Retrieving package versions for external-dns.tanzu.vmware.com... NAME VERSION RELEASED-AT external-dns.tanzu.vmware.com 0.8.0+vmware.1-tkg.1 2021-06-11 19:00:00 +0100 IST % tanzu package install external-dns -p external-dns.tanzu.vmware.com -v 0.8.0+vmware.1-tkg.1 \ --values-file external-dns.yaml | Installing package 'external-dns.tanzu.vmware.com' / Getting namespace 'default' - Getting package metadata for 'external-dns.tanzu.vmware.com' | Creating service account 'external-dns-default-sa' | Creating cluster admin role 'external-dns-default-cluster-role' | Creating cluster role binding 'external-dns-default-cluster-rolebinding' | Creating secret 'external-dns-default-values' \ Creating package resource / Package install status: Reconciling Added installed package 'external-dns' in namespace 'default' %
External-DNS deploys package resources into the tanzu-system-service-discovery. If we examine the logs of the external-dns pod, we should see a message stating the RFC2136 has been configured.
% kubectl -n tanzu-system-service-discovery logs external-dns-777f74bd6c-zs7bn . . . time="2022-02-15T15:08:09Z" level=info msg="Instantiating new Kubernetes client" time="2022-02-15T15:08:09Z" level=info msg="Using inCluster-config based on serviceaccount-token" time="2022-02-15T15:08:09Z" level=info msg="Created Kubernetes client https://100.64.0.1:443" time="2022-02-15T15:08:10Z" level=info msg="Created Dynamic Kubernetes client https://100.64.0.1:443" time="2022-02-15T15:08:12Z" level=info msg="Configured RFC2136 with zone 'rainpole.com.' and nameserver 'xx.xx.51.252:53'"
Everything looks good. We can now move on to Prometheus.
Prometheus
Prometheus records real-time metrics and provides alerting capabilities. It has a requirement for an Ingress (or HTTPProxy) and that requirement is met by Contour. Prometheus has quite a number of configuration options, most of which I am not displaying here. In my cluster, Prometheus is configured to use an Ingress, and uses an FDQN that is part of the external DNS domain – prometheus-tkgs-cork.rainpole.com (you obviously need to change this to something in your own DNS). It also has some Storage Classes defined for Persistent Volumes. The Prometheus server requires a 150GB volume while the alert manager required a 2GB volume.
% cat prometheus.yaml ingress: enabled: true virtual_host_fqdn: "prometheus-tkgs-cork.rainpole.com" prometheus_prefix: "/" alertmanager_prefix: "/alertmanager/" prometheusServicePort: 80 alertmanagerServicePort: 80 prometheus: pvc: storageClassName: vsan-default-storage-policy alertmanager: pvc: storageClassName: vsan-default-storage-policy
We now deploy the package with the above values.
% tanzu package available list prometheus.tanzu.vmware.com - Retrieving package versions for prometheus.tanzu.vmware.com... NAME VERSION RELEASED-AT prometheus.tanzu.vmware.com 2.27.0+vmware.1-tkg.1 2021-05-12 19:00:00 +0100 IST % tanzu package install prometheus --package-name prometheus.tanzu.vmware.com \ --version 2.27.0+vmware.1-tkg.1 --values-file prometheus.yaml - Installing package 'prometheus.tanzu.vmware.com' | Getting namespace 'default' / Getting package metadata for 'prometheus.tanzu.vmware.com' | Creating service account 'prometheus-default-sa' | Creating cluster admin role 'prometheus-default-cluster-role' | Creating cluster role binding 'prometheus-default-cluster-rolebinding' | Creating secret 'prometheus-default-values' \ Creating package resource | Package install status: Reconciling Added installed package 'prometheus' in namespace 'default' %
This creates a significant number of package resources for Prometheus in the namespace tanzu-monitoring-system.
% kubectl get deploy,rs,pods -n tanzu-system-monitoring NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/alertmanager 1/1 1 1 5m11s deployment.apps/prometheus-kube-state-metrics 1/1 1 1 5m13s deployment.apps/prometheus-pushgateway 1/1 1 1 5m12s deployment.apps/prometheus-server 1/1 1 1 5m13s NAME DESIRED CURRENT READY AGE replicaset.apps/alertmanager-669c4f497d 1 1 1 5m11s replicaset.apps/prometheus-kube-state-metrics-6ccbc7bfc 1 1 1 5m13s replicaset.apps/prometheus-pushgateway-6d7bc967f9 1 1 1 5m12s replicaset.apps/prometheus-server-7cc7df4dd6 1 1 1 5m13s NAME READY STATUS RESTARTS AGE pod/alertmanager-669c4f497d-wsx2s 1/1 Running 0 5m11s pod/prometheus-cadvisor-fxhw8 1/1 Running 0 5m13s pod/prometheus-cadvisor-m758x 1/1 Running 0 5m13s pod/prometheus-cadvisor-mzzm7 1/1 Running 0 5m13s pod/prometheus-kube-state-metrics-6ccbc7bfc-fsggg 1/1 Running 0 5m13s pod/prometheus-node-exporter-24rc7 1/1 Running 0 5m13s pod/prometheus-node-exporter-9b9nh 1/1 Running 0 5m13s pod/prometheus-node-exporter-g6vtp 1/1 Running 0 5m13s pod/prometheus-node-exporter-p8tkk 1/1 Running 0 5m13s pod/prometheus-pushgateway-6d7bc967f9-stctt 1/1 Running 0 5m12s pod/prometheus-server-7cc7df4dd6-ncbhz 2/2 Running 0 5m13s % kubectl get httpproxy -n tanzu-system-monitoring NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION prometheus-httpproxy prometheus-tkgs-cork.rainpole.com prometheus-tls valid Valid HTTPProxy % kubectl get pvc,pv -n tanzu-system-monitoring NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/alertmanager Bound pvc-912dacdd-6533-45a4-948a-34b9fd383d37 2Gi RWO vsan-default-storage-policy 5m29s persistentvolumeclaim/prometheus-server Bound pvc-08e145e4-a267-4e1e-89d2-513cff467512 150Gi RWO vsan-default-storage-policy 5m29s NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-08e145e4-a267-4e1e-89d2-513cff467512 150Gi RWO Delete Bound tanzu-system-monitoring/prometheus-server vsan-default-storage-policy 3m20s persistentvolume/pvc-912dacdd-6533-45a4-948a-34b9fd383d37 2Gi RWO Delete Bound tanzu-system-monitoring/alertmanager vsan-default-storage-policy 5m27s % kubectl get svc -n tanzu-system-monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager ClusterIP 100.71.143.203 <none> 80/TCP 6m15s prometheus-kube-state-metrics ClusterIP None <none> 80/TCP,81/TCP 6m17s prometheus-node-exporter ClusterIP 100.67.91.192 <none> 9100/TCP 6m17s prometheus-pushgateway ClusterIP 100.70.217.119 <none> 9091/TCP 6m17s prometheus-server ClusterIP 100.67.174.220 <none> 80/TCP 6m17s
Assuming the httpproxy has been successfully created, and that external DNS is working, it should now be possible to connect to the FQDN of the Prometheus service via a browser. If you did not integrate with DNS, then you can add the Prometheus FQDN and the Envoy Load Balancer IP address to you local /etc/hosts file. This should also work. It should then display something similar to the following:
With Prometheus successfully installed, we can move on to deploying the Grafana package for dashboards.
Grafana
As before, the Grafana configuration values can be retrieved using the same method shown earlier. There are quite a number of values, but as before, we will again try to keep the values file quite simple. Through the values file, we can provide a data source (to Prometheus). The Prometheus URL is an internal Kubernetes URL, made up of the Pod Name and Namespace of the Prometheus Server. Since Grafana is also running in the same cluster, Prometheus and Grafana can communicate using internal Kubernetes networking. This is my sample Grafana values file.
% cat grafana.yaml namespace: tanzu-system-dashboard ingress: virtual_host_fqdn: "graf-tkgs-cork.rainpole.com" grafana: config: datasource_yaml: |- apiVersion: 1 datasources: - name: Prometheus type: prometheus url: prometheus-server.tanzu-system-monitoring.svc.cluster.local access: proxy isDefault: true pvc: storageClassName: vsan-default-storage-policy
% tanzu package available list grafana.tanzu.vmware.com \ Retrieving package versions for grafana.tanzu.vmware.com... NAME VERSION RELEASED-AT grafana.tanzu.vmware.com 7.5.7+vmware.1-tkg.1 2021-05-19 19:00:00 +0100 IST % tanzu package install grafana -p grafana.tanzu.vmware.com -v 7.5.7+vmware.1-tkg.1 \ --values-file grafana.yaml - Installing package 'grafana.tanzu.vmware.com' | Getting namespace 'default' / Getting package metadata for 'grafana.tanzu.vmware.com' | Creating service account 'grafana-default-sa' - Creating cluster admin role 'grafana-default-cluster-role' | Creating cluster role binding 'grafana-default-cluster-rolebinding' | Creating secret 'grafana-default-values' \ Creating package resource | Package install status: Reconciling Added installed package 'grafana' in namespace 'default' %
Once again, a number of different Grafana package resources aree created in the tanzu-system-dashboard namespace. Note that Grafana gets a Load Balancer IP and its own httpproxy. The Grafana FQDN should also be automatically added to your external DNS, if configured.
% kubectl -n tanzu-system-dashboard get pods NAME READY STATUS RESTARTS AGE grafana-7fc98dd5b8-bq2mw 2/2 Running 0 2m25s % kubectl -n tanzu-system-dashboard get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE grafana LoadBalancer 100.67.139.55 xx.xx.62.21 80:31126/TCP 2m33s % kubectl -n tanzu-system-dashboard get httpproxy NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION grafana-httpproxy graf-tkgs-cork.rainpole.com grafana-tls valid Valid HTTPProxy % kubectl -n tanzu-system-dashboard get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE grafana-pvc Bound pvc-72e06aa7-9944-439a-80e0-40fecd61915b 2Gi RWO vsan-default-storage-policy 2m54s
Assuming everything is working correctly, it should now be possible to connect to the Grafana FQDN to see the dashboards from the vSphere with Tanzu workload cluster. Again, if you haven’t integrated with DNS, just add the Grafana FQDN and Grafana Load Balancer IP Address shown above to the local /etc/hosts. The default login for Grafana is admin/admin, and you are then prompeted to set a new admin password. Then navigate to Dashboards > Manage on the left hand side and select one of the two existing dashboards to see some metrics. This is the TKG Kubernetes cluster monitoring dashboard. Of course, you can then proceed to build your own bespoke dashboards if you wish.
Hopefully this has provided some useful insight into how easy it is to stand up a Prometheus and Grafana monitoring stack on a workload cluster provisioned via the TKG Service in vSphere with Tanzu. Here are the full list of packages that have now been installed on the cluster.
% tanzu package installed list - Retrieving installed packages... NAME PACKAGE-NAME PACKAGE-VERSION STATUS cert-manager cert-manager.tanzu.vmware.com 1.1.0+vmware.1-tkg.2 Reconcile succeeded contour contour.tanzu.vmware.com 1.17.1+vmware.1-tkg.1 Reconcile succeeded external-dns external-dns.tanzu.vmware.com 0.8.0+vmware.1-tkg.1 Reconcile succeeded grafana grafana.tanzu.vmware.com 7.5.7+vmware.1-tkg.1 Reconcile succeeded prometheus prometheus.tanzu.vmware.com 2.27.0+vmware.1-tkg.1 Reconcile succeeded % kubectl get apps NAME DESCRIPTION SINCE-DEPLOY AGE cert-manager Reconcile succeeded 58s 172m contour Reconcile succeeded 29s 35m external-dns Reconcile succeeded 39s 25m grafana Reconcile succeeded 2m20s 2m24s prometheus Reconcile succeeded 22s 17m
Finally, please note that Tanzu Services is now available in VMware Cloud. VMware Cloud allows vSphere administrators to deploy Tanzu Kubernetes workload clusters via the TKG Service for their DevOps teams without having to manage the underlying SDDC infrastructure. Read more about Tanzu Services here.