Prometheus & Grafana Monitoring Stack on TKGS workload cluster in vSphere with Tanzu

In this post, we are going to build on the work already done when we deployed Carvel packages on a Tanzu Kubernetes workload cluster created by the TKG Service in vSphere with Tanzu. We saw in that post what the requirements are, how to use the tanzu command line to set context to a workload cluster, add the TKG v1.4 package repository. We also saw how to use the tanzu CLI to deploy our first package, which was cert manager. We will now continue with the deployment of a number of other packages, such as Contour (for Ingress), External-DNS (to connect to our existing DNS), Prometheus (for monitoring our cluster) and finally Grafana (for dashboards). The goal here is to stand up a monitoring stack to monitor the Tanzu Kubernetes cluster. What we will see is that these packages have be built to work well together in Tanzu Kubernetes with minimal configuration, and you should be able to get this monitoring stack stood up quickly without too much effort. We will look at how to retrieve the various configuration values from each package and deploy the packages with some bespoke values. Let’s look at that in detail.

Cert Manager (already installed)

We already deployed Cert Manager in the previous post. Cert Manager makes the monitoring app stack more secure. We use it to secure communications between Contour and the Envoy Ingress which we will deploy next. Cert Manager automates certificate management, providing certificates-as-a-service capabilities. As we saw previously, there is no requirement to supply any bespoke values to Cert Manager. The only configuration option is the namespace in which cert-manager deploys its package resources (default: cert-manager).

% tanzu package installed list
\ Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION       STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2  Reconcile succeeded

Contour

The reason for installing Contour is because Prometheus has a requirement on an Ingress. Contour provides this functionality via an Envoy Ingress controller. Contour is an open source Kubernetes Ingress controller that acts as a control plane for Envoy.​ Since this deployment is using an NSX ALB, we can provide a bespoke Contour data values file to set the Envoy service type to Load Balancer, as well as set the number of Contour replicas. We can retrieve the configurable values using the following commands. First, we need to get the Contour packager version.

% tanzu package available list contour.tanzu.vmware.com
/ Retrieving package versions for contour.tanzu.vmware.com...
  NAME                      VERSION                RELEASED-AT
  contour.tanzu.vmware.com  1.17.1+vmware.1-tkg.1  2021-07-23 19:00:00 +0100 IST

With the package version, we can retrieve the default Contour values. (Note that you use exactly the same methodology to retrieve the values file from every other Carvel package. So while I show the detailed procedure for the Contour package, I will not repeat it for the other packages we install in this post.)

% image_url=$(kubectl get packages contour.tanzu.vmware.com.1.17.1+vmware.1-tkg.1 \
-o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')


% echo $image_url
projects.registry.vmware.com/tkg/packages/standard/contour@sha256:73dc13131e6c1cfa8d3b56aeacd97734447acdf1ab8c0862e936623ca744e7c4


% mkdir ./contour


% imgpkg pull -b $image_url -o ./contour
Pulling bundle 'projects.registry.vmware.com/tkg/packages/standard/contour@sha256:73dc13131e6c1cfa8d3b56aeacd97734447acdf1ab8c0862e936623ca744e7c4'
  Extracting layer 'sha256:93c1f3e88f0e0181e11a38a4e04ac16c21c5949622917b6c72682cc497ab3e44' (1/1)

Locating image lock file images...
One or more images not found in bundle repo; skipping lock file update

Succeeded

%

The values file is now available in ./contour/config/values.yaml. These are the default values with which the package is deployed, but can be modified to meet various requirements. Here are the default Contour values:

infrastructure_provider: vsphere
namespace: tanzu-system-ingress
contour:
  configFileContents: {}
  useProxyProtocol: false
  replicas: 2
  pspNames: "vmware-system-restricted"
  logLevel: info
envoy:
  service:
    type: null
    annotations: {}
    nodePorts:
      http: null
      https: null
    externalTrafficPolicy: Cluster
    aws:
      LBType: classic
    disableWait: false
  hostPorts:
    enable: true
    http: 80
    https: 443
  hostNetwork: false
  terminationGracePeriodSeconds: 300
  logLevel: info
  pspNames: null
certificates:
  duration: 8760h
  renewBefore: 360h

I usually make a copy of the values file before making any changes. Note that not every field needs to be added to your own bespoke values file. For the purposes of deploying this package on my workload cluster, I simply want to use a service type of LoadBalancer for Envoy, and to use Cert Manager which was previously installed. Thus my bespoke values file for Contour would look something like this.

% cat contour.yaml
envoy:
  service:
    type: LoadBalancer
certificates:
  useCertManager: true

I can now deploy the Contour package with my bespoke values file which I have called contour.yaml, as follows:

% tanzu package available list contour.tanzu.vmware.com
/ Retrieving package versions for contour.tanzu.vmware.com...
  NAME                      VERSION                RELEASED-AT
  contour.tanzu.vmware.com  1.17.1+vmware.1-tkg.1  2021-07-23 19:00:00 +0100 IST


% tanzu package install contour --package-name contour.tanzu.vmware.com --version 1.17.1+vmware.1-tkg.1 \
--values-file contour.yaml
/ Installing package 'contour.tanzu.vmware.com'
| Getting namespace 'default'
/ Getting package metadata for 'contour.tanzu.vmware.com'
| Creating service account 'contour-default-sa'
| Creating cluster admin role 'contour-default-cluster-role'
| Creating cluster role binding 'contour-default-cluster-rolebinding'
| Creating secret 'contour-default-values'
\ Creating package resource
\ Package install status: Reconciling

 Added installed package 'contour' in namespace 'default'

%

You can check that the package has installed successfully in a number of different ways.

% kubectl get apps
NAME          DESCRIPTION          SINCE-DEPLOY  AGE
cert-manager  Reconcile succeeded  45s            138m
contour       Reconcile succeeded  92s            96s


% tanzu package installed list
- Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded

You should be able to examine the package resource objects which have been deployed in the namespace tanzu-system-ingress (as configured in the default values file), including the envoy daemonset as shown below.

% kubectl get daemonset/envoy -n tanzu-system-ingress
NAME    DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
envoy   3        3        3      3            3          <none>          2m7s

We should also observe that Envoy has been allocated a Load Balancer service.

% kubectl get svc -n tanzu-system-ingress
NAME      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
contour   ClusterIP      100.69.120.194   <none>        8001/TCP                     5m9s
envoy     LoadBalancer   100.71.42.85     xx.xx.62.20   80:30368/TCP,443:32723/TCP   5m8s

We can now move onto deploying External-DNS.

External-DNS

This is an optional package. It allows Kubernetes services to be automatically added to your external DNS. This is useful when we look at the Grafana dashboards later, as the FQDN can be used to access them rather than the IP address. In this example, I am integrating insecure connections to Microsoft DNS, as per these instructions. There are 2 steps to be carried out on the Microsoft DNS configuration for the domain that you plan to integrate with: (1) Allow both secure and non-secure dynamic updates, and (2) allow zone transfers to any server. I am using the RFC2136 provider. This allows me to use any RFC2136-compatible DNS server as a provider for External-DNS, such as Microsoft DNS. I am integrating with my rainpole.com domain. Included is a copy of my External-DNS values file. Note the inclusion of the rfc2136-insecure argument (support insecure dynamic updates) and the rfc2136-tsig-axfr (support zone transfers). Zone transfers are needed for the deletion of records. Notice also that the source has been set to contour-httpproxy, implying that services created using it will be added to the DNS, e.g. Prometheus and Grafana. We will see this later.

$ cat external-dns.yaml
namespace: tanzu-system-service-discovery
deployment:
  args:
    - --registry=txt
    - --txt-prefix=external-dns-
    - --txt-owner-id=tanzu
    - --provider=rfc2136
    - --rfc2136-host=xx.xx.51.252
    - --rfc2136-port=53
    - --rfc2136-zone=rainpole.com
    - --rfc2136-insecure
    - --rfc2136-tsig-axfr
    - --source=service
    - --source=contour-httpproxy
    - --source=ingress
    - --domain-filter=rainpole.com

Let’s now deploy this package with the above values.

% tanzu package available list external-dns.tanzu.vmware.com
- Retrieving package versions for external-dns.tanzu.vmware.com...
  NAME                           VERSION               RELEASED-AT
  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1  2021-06-11 19:00:00 +0100 IST


% tanzu package install external-dns -p external-dns.tanzu.vmware.com -v 0.8.0+vmware.1-tkg.1 \
--values-file external-dns.yaml
| Installing package 'external-dns.tanzu.vmware.com'
/ Getting namespace 'default'
- Getting package metadata for 'external-dns.tanzu.vmware.com'
| Creating service account 'external-dns-default-sa'
| Creating cluster admin role 'external-dns-default-cluster-role'
| Creating cluster role binding 'external-dns-default-cluster-rolebinding'
| Creating secret 'external-dns-default-values'
\ Creating package resource
/ Package install status: Reconciling

 Added installed package 'external-dns' in namespace 'default'

%

External-DNS deploys package resources into the tanzu-system-service-discovery. If we examine the logs of the external-dns pod, we should see a message stating the RFC2136 has been configured.

% kubectl -n tanzu-system-service-discovery logs  external-dns-777f74bd6c-zs7bn
.
.
.
time="2022-02-15T15:08:09Z" level=info msg="Instantiating new Kubernetes client"
time="2022-02-15T15:08:09Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2022-02-15T15:08:09Z" level=info msg="Created Kubernetes client https://100.64.0.1:443"
time="2022-02-15T15:08:10Z" level=info msg="Created Dynamic Kubernetes client https://100.64.0.1:443"
time="2022-02-15T15:08:12Z" level=info msg="Configured RFC2136 with zone 'rainpole.com.' and nameserver 'xx.xx.51.252:53'"

Everything looks good. We can now move on to Prometheus.

Prometheus

Prometheus records real-time metrics and provides alerting capabilities. It has a requirement for an Ingress (or HTTPProxy) and that requirement is met by Contour. Prometheus has quite a number of configuration options, most of which I am not displaying here. In my cluster, Prometheus is configured to use an Ingress, and uses an FDQN that is part of the external DNS domain – prometheus-tkgs-cork.rainpole.com (you obviously need to change this to something in your own DNS). It also has some Storage Classes defined for Persistent Volumes. The Prometheus server requires a 150GB volume while the alert manager required a 2GB volume.

% cat prometheus.yaml
ingress:
  enabled: true
  virtual_host_fqdn: "prometheus-tkgs-cork.rainpole.com"
  prometheus_prefix: "/"
  alertmanager_prefix: "/alertmanager/"
  prometheusServicePort: 80
  alertmanagerServicePort: 80
prometheus:
  pvc:
    storageClassName: vsan-default-storage-policy
alertmanager:
  pvc:
    storageClassName: vsan-default-storage-policy

We now deploy the package with the above values.

% tanzu package available list prometheus.tanzu.vmware.com
- Retrieving package versions for prometheus.tanzu.vmware.com...
  NAME                         VERSION                RELEASED-AT
  prometheus.tanzu.vmware.com  2.27.0+vmware.1-tkg.1  2021-05-12 19:00:00 +0100 IST


% tanzu package install prometheus --package-name prometheus.tanzu.vmware.com \
--version 2.27.0+vmware.1-tkg.1 --values-file prometheus.yaml
- Installing package 'prometheus.tanzu.vmware.com'
| Getting namespace 'default'
/ Getting package metadata for 'prometheus.tanzu.vmware.com'
| Creating service account 'prometheus-default-sa'
| Creating cluster admin role 'prometheus-default-cluster-role'
| Creating cluster role binding 'prometheus-default-cluster-rolebinding'
| Creating secret 'prometheus-default-values'
\ Creating package resource
| Package install status: Reconciling

 Added installed package 'prometheus' in namespace 'default'

%

This creates a significant number of package resources for Prometheus in the namespace tanzu-monitoring-system.

% kubectl get deploy,rs,pods -n tanzu-system-monitoring
NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/alertmanager                    1/1     1            1           5m11s
deployment.apps/prometheus-kube-state-metrics   1/1     1            1           5m13s
deployment.apps/prometheus-pushgateway          1/1     1            1           5m12s
deployment.apps/prometheus-server               1/1     1            1           5m13s

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/alertmanager-669c4f497d                   1         1         1       5m11s
replicaset.apps/prometheus-kube-state-metrics-6ccbc7bfc   1         1         1       5m13s
replicaset.apps/prometheus-pushgateway-6d7bc967f9         1         1         1       5m12s
replicaset.apps/prometheus-server-7cc7df4dd6              1         1         1       5m13s

NAME                                                READY   STATUS    RESTARTS   AGE
pod/alertmanager-669c4f497d-wsx2s                   1/1     Running   0          5m11s
pod/prometheus-cadvisor-fxhw8                       1/1     Running   0          5m13s
pod/prometheus-cadvisor-m758x                       1/1     Running   0          5m13s
pod/prometheus-cadvisor-mzzm7                       1/1     Running   0          5m13s
pod/prometheus-kube-state-metrics-6ccbc7bfc-fsggg   1/1     Running   0          5m13s
pod/prometheus-node-exporter-24rc7                  1/1     Running   0          5m13s
pod/prometheus-node-exporter-9b9nh                  1/1     Running   0          5m13s
pod/prometheus-node-exporter-g6vtp                  1/1     Running   0          5m13s
pod/prometheus-node-exporter-p8tkk                  1/1     Running   0          5m13s
pod/prometheus-pushgateway-6d7bc967f9-stctt         1/1     Running   0          5m12s
pod/prometheus-server-7cc7df4dd6-ncbhz              2/2     Running   0          5m13s


% kubectl get httpproxy -n tanzu-system-monitoring
NAME                   FQDN                                TLS SECRET      STATUS  STATUS DESCRIPTION
prometheus-httpproxy   prometheus-tkgs-cork.rainpole.com   prometheus-tls  valid   Valid HTTPProxy


% kubectl get pvc,pv -n tanzu-system-monitoring
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/alertmanager        Bound    pvc-912dacdd-6533-45a4-948a-34b9fd383d37   2Gi        RWO            vsan-default-storage-policy   5m29s
persistentvolumeclaim/prometheus-server   Bound    pvc-08e145e4-a267-4e1e-89d2-513cff467512   150Gi      RWO            vsan-default-storage-policy   5m29s

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                       STORAGECLASS                  REASON   AGE
persistentvolume/pvc-08e145e4-a267-4e1e-89d2-513cff467512   150Gi      RWO            Delete           Bound    tanzu-system-monitoring/prometheus-server   vsan-default-storage-policy            3m20s
persistentvolume/pvc-912dacdd-6533-45a4-948a-34b9fd383d37   2Gi        RWO            Delete           Bound    tanzu-system-monitoring/alertmanager        vsan-default-storage-policy            5m27s


% kubectl get svc -n tanzu-system-monitoring
NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
alertmanager                    ClusterIP   100.71.143.203   <none>        80/TCP          6m15s
prometheus-kube-state-metrics   ClusterIP   None             <none>        80/TCP,81/TCP   6m17s
prometheus-node-exporter        ClusterIP   100.67.91.192    <none>        9100/TCP        6m17s
prometheus-pushgateway          ClusterIP   100.70.217.119   <none>        9091/TCP        6m17s
prometheus-server               ClusterIP   100.67.174.220   <none>        80/TCP          6m17s

Assuming the httpproxy has been successfully created, and that external DNS is working, it should now be possible to connect to the FQDN of the Prometheus service via a browser. If you did not integrate with DNS, then you can add the Prometheus FQDN and the Envoy Load Balancer IP address to you local /etc/hosts file. This should also work. It should then display something similar to the following:

With Prometheus successfully installed, we can move on to deploying the Grafana package for dashboards.

Grafana

As before, the Grafana configuration values can be retrieved using the same method shown earlier. There are quite a number of values, but as before, we will again try to keep the values file quite simple. Through the values file, we can provide a data source (to Prometheus). The Prometheus URL is an internal Kubernetes URL, made up of the  Pod Name and Namespace of the Prometheus Server. Since Grafana is also running in the same cluster, Prometheus and Grafana can communicate using internal Kubernetes networking. This is my sample Grafana values file.

% cat grafana.yaml
namespace: tanzu-system-dashboard
ingress:
  virtual_host_fqdn: "graf-tkgs-cork.rainpole.com"
grafana:
  config:
    datasource_yaml: |-
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
         url: prometheus-server.tanzu-system-monitoring.svc.cluster.local
          access: proxy
          isDefault: true
  pvc:
    storageClassName: vsan-default-storage-policy
% tanzu package available list grafana.tanzu.vmware.com
\ Retrieving package versions for grafana.tanzu.vmware.com...
  NAME                      VERSION               RELEASED-AT
  grafana.tanzu.vmware.com  7.5.7+vmware.1-tkg.1  2021-05-19 19:00:00 +0100 IST


% tanzu package install grafana -p grafana.tanzu.vmware.com -v 7.5.7+vmware.1-tkg.1 \
--values-file grafana.yaml
- Installing package 'grafana.tanzu.vmware.com'
| Getting namespace 'default'
/ Getting package metadata for 'grafana.tanzu.vmware.com'
| Creating service account 'grafana-default-sa'
- Creating cluster admin role 'grafana-default-cluster-role'
| Creating cluster role binding 'grafana-default-cluster-rolebinding'
| Creating secret 'grafana-default-values'
\ Creating package resource
| Package install status: Reconciling

 Added installed package 'grafana' in namespace 'default'

%

Once again, a number of different Grafana package resources aree created in the tanzu-system-dashboard namespace. Note that Grafana gets a Load Balancer IP and its own httpproxy. The Grafana FQDN should also be automatically added to your external DNS, if configured.

% kubectl -n tanzu-system-dashboard get pods
NAME                       READY   STATUS    RESTARTS   AGE
grafana-7fc98dd5b8-bq2mw   2/2     Running   0          2m25s


% kubectl -n tanzu-system-dashboard get svc
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
grafana   LoadBalancer   100.67.139.55   xx.xx.62.21   80:31126/TCP   2m33s


% kubectl -n tanzu-system-dashboard get httpproxy
NAME                FQDN                          TLS SECRET    STATUS   STATUS DESCRIPTION
grafana-httpproxy   graf-tkgs-cork.rainpole.com   grafana-tls   valid    Valid HTTPProxy


% kubectl -n tanzu-system-dashboard get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
grafana-pvc   Bound    pvc-72e06aa7-9944-439a-80e0-40fecd61915b   2Gi        RWO            vsan-default-storage-policy   2m54s

Assuming everything is working correctly, it should now be possible to connect to the Grafana FQDN to see the dashboards from the vSphere with Tanzu workload cluster. Again, if you haven’t integrated with DNS, just add the Grafana FQDN and Grafana Load Balancer IP Address shown above to the local /etc/hosts. The default login for Grafana is admin/admin, and you are then prompeted to set a new admin password. Then navigate to Dashboards > Manage on the left hand side and select one of the two existing dashboards to see some metrics. This is the TKG Kubernetes cluster monitoring dashboard. Of course, you can then proceed to build your own bespoke dashboards if you wish.

Hopefully this has provided some useful insight into how easy it is to stand up a Prometheus and Grafana monitoring stack on a workload cluster provisioned via the TKG Service in vSphere with Tanzu. Here are the full list of packages that have now been installed on the cluster.

% tanzu package installed list
- Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded
  external-dns  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1   Reconcile succeeded
  grafana       grafana.tanzu.vmware.com       7.5.7+vmware.1-tkg.1   Reconcile succeeded
  prometheus    prometheus.tanzu.vmware.com    2.27.0+vmware.1-tkg.1  Reconcile succeeded


% kubectl get apps
NAME           DESCRIPTION           SINCE-DEPLOY   AGE
cert-manager   Reconcile succeeded   58s            172m
contour        Reconcile succeeded   29s            35m
external-dns   Reconcile succeeded   39s            25m
grafana        Reconcile succeeded   2m20s          2m24s
prometheus     Reconcile succeeded   22s            17m

Finally, please note that Tanzu Services is now available in VMware Cloud. VMware Cloud allows vSphere administrators to deploy Tanzu Kubernetes workload clusters via the TKG Service for their DevOps teams without having to manage the underlying SDDC infrastructure. Read more about Tanzu Services here.