Deploying a monitoring stack (Prometheus and Grafana) on TKG v1.4 with External-DNS

Many customers who have deployed Tanzu Kubernetes would like to monitor activity on the cluster. In TKG v1.4, VMware provides all of the packages one would required to setup a full monitoring stack using Prometheus and Grafana. Prometheus records real-time metrics and Grafana provides charts, graphs, and alerts when connected to a supported data source, such as Prometheus. Prometheus has a dependency on an Ingress, which we will provide through the Contour controller package (which includes an Envoy Ingress). In fact, Prometheus leverages a special kind of Ingress called a HTTPProxy which is provided with Contour. We are also going to install the Cert-Manager package, although it is optional. The Cert-Manager can provide for secure communication between Contour and Envoy. Another optional package is External-DNS, which we will deploy to integrate the Prometheus and Grafana FQDNs into our Microsoft DNS. There is quite a lot here, but don’t worry – the deployment is quite straight-forward. In this post, we will see how Prometheus and Grafana can be used to monitor a TKG cluster, and also how to integrate these apps with an external DNS provider.

In this setup, there is an existing deployment of both a TKG management cluster and a workload cluster. The deployment is to vSphere 7.0U2c using NSX ALB version 2.0.1.5 to provide Load Balancer services to the cluster.. The clusters are integrated with LDAP (MS Active Directory) through the use of Pinniped and Dex. Deployment of TKG, NSX-ALB and LDAP integrations are not shown in this post, but you can find this details on how to do this in other posts on this site. This environment also has an external DNS server (Microsoft DNS), which provides lookup services for both the vSphere infrastructure and workloads. This will be leveraged to add DNS records for the Prometheus and Grafana applications, both of which have an FQDN requirement. We will be deploying onto the workload cluster.

Whilst External-DNS can be integrated with many DNS providers, there are a few caveats when integrating with Microsoft DNS. First, we need to allow both secure and non-secure dynamic updates, and second, we need to configure it to allow zone transfers “to any server”. Zone transfers are needed for the deletion of records. Here are screenshots from the properties of my Microsoft DNS rainpole.com domain’s properties:

2. Deploy Cert Manager

Cert Manager is an optional package, but we shall install it anyway to make the monitoring app stack more secure. We will use it to secure communications between Contour and the Envoy Ingress. Cert-manager automates certificate management. There is no requirement to supply any bespoke data values for the Cert Manager. The only configuration option is the namespace in which to deploy cert-manager (default: tanzu-certificates). Use the following commands to check which versions of the Cert Manager package are available for installation, and install the package.

$ tanzu package available list cert-manager.tanzu.vmware.com
- Retrieving package versions for cert-manager.tanzu.vmware.com...
  NAME                           VERSION               RELEASED-AT
  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2  2020-11-24T18:00:00Z


$ tanzu package install cert-manager -p cert-manager.tanzu.vmware.com --version 1.1.0+vmware.1-tkg.2
/ Installing package 'cert-manager.tanzu.vmware.com'
| Getting namespace 'default'
| Getting package metadata for 'cert-manager.tanzu.vmware.com'
| Creating service account 'cert-manager-default-sa'
| Creating cluster admin role 'cert-manager-default-cluster-role'
| Creating cluster role binding 'cert-manager-default-cluster-rolebinding'
- Creating package resource
| Package install status: Reconciling

 Added installed package 'cert-manager' in namespace 'default'


$ tanzu package installed list
/ Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION       STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2  Reconcile succeeded


$ kubectl get apps
NAME            DESCRIPTION            SINCE-DEPLOY AGE
cert-manager    Reconcile succeeded    54s          60s

3. Install Contour

The next step is to install the Contour Ingress controller, which uses Envoy to provide a special Ingress called HTTPProxy. We need to make some changes to the default deployment to tell it that the Envoy service should use a LoadBalancer service, and also to use Cert Manager for TLS certificates. These directives can be seen in the contour-simple.yaml manifest shown below, and also how to include them in the deployment through the –values-file option in the install command. Note that Contour needs to be installed before External-DNS to enable Contour HTTPProxy support. If Contour is not installed before External-DNS, it won’t be possible to use HTTPProxy as a source in the External-DNS configuration later on.

$ cat contour-simple.yaml
envoy:
  service:
    type: LoadBalancer
certificates:
  useCertManager: true


$ tanzu package available list contour.tanzu.vmware.com 
- Retrieving package versions for contour.tanzu.vmware.com...   
NAME                      VERSION                RELEASED-AT   
contour.tanzu.vmware.com  1.17.1+vmware.1-tkg.1  2021-07-23T18:00:00Z


$ tanzu package install contour -p contour.tanzu.vmware.com --version 1.17.1+vmware.1-tkg.1 --values-file contour-simple.yaml
/ Installing package 'contour.tanzu.vmware.com'
| Getting namespace 'default'
| Getting package metadata for 'contour.tanzu.vmware.com'
| Creating service account 'contour-default-sa'
| Creating cluster admin role 'contour-default-cluster-role'
| Creating cluster role binding 'contour-default-cluster-rolebinding'
| Creating secret 'contour-default-values'
- Creating package resource
\ Package install status: Reconciling

 Added installed package 'contour' in namespace 'default'


$ tanzu package installed list
- Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded


$ kubectl get apps
NAME           DESCRIPTION          SINCE-DEPLOY   AGE
cert-manager   Reconcile succeeded  58s            5m59s
contour        Reconcile succeeded  76s            81s

4. Install External-DNS

The External-DNS package will integrate the TKG cluster with our external DNS source. This means that the FQDNs that we choose for applications such as Prometheus and Grafana will be automatically added to our external DNS source. To integrate with Microsoft DNS, the RFC2136 provider is chosen. This allows any RFC2136-compatible DNS servers to be used as a provider for External-DNS, such as Microsoft DNS. As mentioned, I am integrating with my rainpole.com domain. Note that External-DNS only supports Microsoft DNS via insecure updates, thus  the inclusion of the rfc2136-insecure argument (support insecure dynamic updates) and the rfc2136-tsig-axfr (support zone transfers). Note also the use of TXT registry. As per this note, attempting to use a CNAME with a TXT registry, the –txt-prefix= must be set to avoid records using the same name. This External-DNS is configured for Service, Ingress and HTTPProxy sources. All of these settings are placed in a values file and passed in as the package is installed.

$ cat external-dns.yaml
namespace: tanzu-system-service-discovery
deployment:
args:
- --registry=txt
- --txt-prefix=external-dns-
- --txt-owner-id=tanzu
- --provider=rfc2136
- --rfc2136-host=xx.xx.51.252
- --rfc2136-port=53
- --rfc2136-zone=rainpole.com
- --rfc2136-insecure
- --rfc2136-tsig-axfr
- --source=service
- --source=contour-httpproxy
- --source=ingress
- --domain-filter=rainpole.com


$ tanzu package available list external-dns.tanzu.vmware.com
/ Retrieving package versions for external-dns.tanzu.vmware.com...
  NAME                           VERSION               RELEASED-AT
  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1  2021-06-11T18:00:00Z


$ tanzu package install external-dns -p external-dns.tanzu.vmware.com -v 0.8.0+vmware.1-tkg.1 --values-file external-dns.yaml
\ Installing package 'external-dns.tanzu.vmware.com'
| Getting namespace 'default'
| Getting package metadata for 'external-dns.tanzu.vmware.com'
| Creating service account 'external-dns-default-sa'
| Creating cluster admin role 'external-dns-default-cluster-role'
| Creating cluster role binding 'external-dns-default-cluster-rolebinding'
| Creating secret 'external-dns-default-values'
- Creating package resource
/ Package install status: Reconciling

 Added installed package 'external-dns' in namespace 'default'


$ tanzu package installed list
- Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded
  external-dns  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1   Reconcile succeeded


$ kubectl  get apps
NAME           DESCRIPTION           SINCE-DEPLOY   AGE
cert-manager   Reconcile succeeded   44s            9m41s
contour        Reconcile succeeded   58s            5m3s
external-dns   Reconcile succeeded   30s            92s

5. Verify External-DNS Deployment

Check the logs of the external-dns pod located in the tanzu-system-service-discovery namespace. Ensure that the “Configured RFC2136 with zone message appears in the logs.
$ kubectl get pods -n tanzu-system-service-discovery
NAME READY STATUS RESTARTS AGE
external-dns-c59745fc6-xhzzj 1/1 Running 0 8m33s


$ kubectl logs external-dns-c59745fc6-xhzzj -n tanzu-system-service-discovery
.
.
time="2021-12-07T09:01:42Z" level=info msg="Instantiating new Kubernetes client"
time="2021-12-07T09:01:42Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2021-12-07T09:01:42Z" level=info msg="Created Kubernetes client https://100.64.0.1:443"
time="2021-12-07T09:01:43Z" level=info msg="Created Dynamic Kubernetes client https://100.64.0.1:443"
time="2021-12-07T09:01:45Z" level=info msg="Configured RFC2136 with zone 'rainpole.com.' and nameserver '10.27.51.252:53'"

6. Deploy Prometheus

The next step is to deploy Prometheus, which will record real-time metrics  from the TKG cluster in a time-series database. In this setup, Prometheus is configured to enable the use of an Ingress (or rather a HTTPProxy), and is also provided with an FDQN that is part of this DNS domain – prometheus.rainpole.com. These settings are included in the Prometheus values file. If all is working after the deployment, we should be able to access the Prometheus dashboard using the FQDN, and resolve it using tools such as nslookup.

$ cat prometheus.yaml
ingress:
  enabled: true
  virtual_host_fqdn: "prometheus.rainpole.com"
  prometheus_prefix: "/"
  alertmanager_prefix: "/alertmanager/"
  prometheusServicePort: 80
  alertmanagerServicePort: 80


$ tanzu package available list prometheus.tanzu.vmware.com
- Retrieving package versions for prometheus.tanzu.vmware.com...
  NAME                        VERSION                RELEASED-AT
  prometheus.tanzu.vmware.com  2.27.0+vmware.1-tkg.1  2021-05-12T18:00:00Z


$ tanzu package install prometheus --package-name prometheus.tanzu.vmware.com --version 2.27.0+vmware.1-tkg.1 --values-file prometheus.yaml
| Installing package 'prometheus.tanzu.vmware.com'
- Installing package 'prometheus.tanzu.vmware.com'
| Getting namespace 'default'
| Getting package metadata for 'prometheus.tanzu.vmware.com'
| Creating service account 'prometheus-default-sa'
| Creating cluster admin role 'prometheus-default-cluster-role'
| Creating cluster role binding 'prometheus-default-cluster-rolebinding'
| Creating secret 'prometheus-default-values'
- Creating package resource
\ Package install status: Reconciling

 Added installed package 'prometheus' in namespace 'default'


$ tanzu package installed list
/ Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded
  external-dns  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1   Reconcile succeeded
  prometheus    prometheus.tanzu.vmware.com    2.27.0+vmware.1-tkg.1  Reconcile succeeded


$ kubectl get apps
NAME          DESCRIPTION          SINCE-DEPLOY  AGE
cert-manager  Reconcile succeeded  56s            53m
contour       Reconcile succeeded  25s            48m
external-dns  Reconcile succeeded  35s            45m
prometheus    Reconcile succeeded  104s          108s


$ kubectl get httpproxy -A
NAMESPACE                NAME                  FQDN                      TLS SECRET      STATUS  STATUS DESCRIPTION
tanzu-system-monitoring  prometheus-httpproxy  prometheus.rainpole.com  prometheus-tls  valid    Valid HTTPProxy


$ nslookup prometheus.rainpole.com
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
Name: prometheus.rainpole.com
Address: xx.xx.62.25

7. Verify Prometheus DNS records added

If the nslookup succeeds, then the external DNS has been successfully updated. It should be possible to see the DNS records get updated via the logs of the external-dns pod. The –txt-prefix set in the configuration has ensured that we get two different names for the A and TXT records, and that there is no clash of names.

time="2021-12-07T09:46:10Z" level=info msg="Adding RR: prometheus.rainpole.com 0 A xx.xx.62.25"
time="2021-12-07T09:46:10Z" level=info msg="Adding RR: external-dns-prometheus.rainpole.com 0 TXT \"heritage=external-dns,external-dns/owner=tanzu,external-dns/resource=HTTPProxy/tanzu-system-monitoring/prometheus-httpproxy\""

It should now be possible to see the DNS A and TXT records in the Microsoft DNS. At this point, it should also be possible to connect to the Prometheus dashboard using the FQDN, in my example prometheus.rainpole.com.

8. Deploy Grafana

We now come to the last part of the setup of a monitoring stack on TKG, and that is the deployment of Grafana. Grafana provides charts, graphs, and alerts when connected to a supported data source, Through the tanzu package mechanism, we can connect Grafana directly to the Prometheus data source configured previously. We will also configure it to use an HTTPProxy and provide the FQDN so that it is automatically added to our external DNS.

$ cat grafana.yaml
grafana:
  config:
    datasource_yaml: |-
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          url: prometheus-server.tanzu-system-monitoring.svc.cluster.local
          access: proxy
          isDefault: true
namespace: tanzu-system-dashboard
ingress:
  virtual_host_fqdn: "grafana.rainpole.com"


$ tanzu package available list grafana.tanzu.vmware.com
| Retrieving package versions for grafana.tanzu.vmware.com...
  NAME                      VERSION              RELEASED-AT
  grafana.tanzu.vmware.com  7.5.7+vmware.1-tkg.1  2021-05-19T18:00:00Z


$ tanzu package install grafana -p grafana.tanzu.vmware.com -v 7.5.7+vmware.1-tkg.1 --values-file grafana.yaml
- Installing package 'grafana.tanzu.vmware.com'
| Getting namespace 'default'
| Getting package metadata for 'grafana.tanzu.vmware.com'
| Creating service account 'grafana-default-sa'
| Creating cluster admin role 'grafana-default-cluster-role'
| Creating cluster role binding 'grafana-default-cluster-rolebinding'
| Creating secret 'grafana-default-values'
- Creating package resource
\ Package install status: Reconciling

 Added installed package 'grafana' in namespace 'default'


$ tanzu package installed list
/ Retrieving installed packages...
  NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
  cert-manager  cert-manager.tanzu.vmware.com  1.1.0+vmware.1-tkg.2   Reconcile succeeded
  contour       contour.tanzu.vmware.com       1.17.1+vmware.1-tkg.1  Reconcile succeeded
  external-dns  external-dns.tanzu.vmware.com  0.8.0+vmware.1-tkg.1   Reconcile succeeded
  grafana       grafana.tanzu.vmware.com       7.5.7+vmware.1-tkg.1   Reconcile succeeded
  prometheus    prometheus.tanzu.vmware.com    2.27.0+vmware.1-tkg.1  Reconcile succeeded


$ kubectl get apps
NAME           DESCRIPTION          SINCE-DEPLOY  AGE
cert-manager   Reconcile succeeded  73s            76m
contour        Reconcile succeeded  41s            71m
external-dns   Reconcile succeeded  41s            68m
grafana        Reconcile succeeded  43s            3m6s
prometheus     Reconcile succeeded  21s            24m


$ kubectl get httpproxy -A
NAMESPACE                 NAME                   FQDN                      TLS SECRET       STATUS   STATUS DESCRIPTION
tanzu-system-dashboard    grafana-httpproxy      grafana.rainpole.com      grafana-tls      valid    Valid  HTTPProxy
tanzu-system-monitoring   prometheus-httpproxy   prometheus.rainpole.com   prometheus-tls   valid    Valid  HTTPProxy


$ nslookup grafana.rainpole.com
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
Name: grafana.rainpole.com
Address: xx.xx.62.25

9. Verify Grafana DNS records added

Once again, if the nslookup succeeds, then the external DNS has been successfully updated. It should once more be possible to see the DNS records get updated via the logs of the external-dns pod.

time="2021-12-07T10:07:23Z" level=info msg="Adding RR: grafana.rainpole.com 0 A 10.27.62.25"
time="2021-12-07T10:07:23Z" level=info msg="Adding RR: external-dns-grafana.rainpole.com 0 TXT \"heritage=external-dns,external-dns/owner=tanzu,external-dns/resource=HTTPProxy/tanzu-system-dashboard/grafana-httpproxy\""

10. Access the Grafana Dashboard

Grafana should now be operational. Open a browser and point it at the Grafana FQDN, in my case grafana.rainpole.com. You should be see the Grafana login appear. The default credentials are admin/admin, but you will be prompted to provide a new password after initial login. From the left hand menu, select Dashboards, then Manage. Two dashboard should be available for selection: Kubernetes / API server and TKG Kubernetes cluster monitoring (via Prometheus). Select the latter, and you should begin to see some K8s metrics visualized, as shown below.

Conclusion

That completes the setup. You now have Prometheus and Grafana working together to provide insights into your TKG cluster. Hopefully this has given you a good idea behind the power and simplicity of the Carvel packages available in TKG. You could of course do more bespoke configurations for each of the packages, but the purpose of this post was just to get you up and running as quickly as possible with the monitoring packages. Hope you found it useful.