In the vSphere CSI controller pod, there are two containers that expose metrics. The first is the vsphere-csi-controller container which provides the communication from the Kubernetes Cluster API server to the CNS component on vCenter server for volume lifecycle operations. The second is the vsphere-syncer container which sends metadata information back about persistent volumes to the CNS component on vCenter server so that it can be displayed in the vSphere client UI in the Container Volumes view. The vsphere-csi-controller container exposes Prometheus metrics from port 2112, while the vsphere-syncer container exposes Prometheus metrics from port 2113. The full list of metrics that are exposed by the CSI driver are here available in the official docs. We can also check this on our cluster.
% kubectl get svc -n vmware-system-csi NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vsphere-csi-controller ClusterIP 10.107.176.153 <none> 2112/TCP,2113/TCP 26d
There are multiple ways to deploy a Prometheus monitoring stack. There is a Prometheus Operator, kube-prometheus, Helm charts maintained by the Prometheus community, and if using TKGm, there is the Carvel tools approach which I have blogged about previously. Since this is a vanilla, upstream K8s cluster where the vSphere CSI driver v2.5 is deployed, I will use the kube-prometheus approach and follow the quickstart guide to very quickly stand up a Prometheus monitoring stack which includes AlertManager and Grafana.
Step 1. Clone the kube-prometheus repository from GitHub
% git clone https://github.com/prometheus-operator/kube-prometheus Cloning into 'kube-prometheus'... remote: Enumerating objects: 15523, done. remote: Counting objects: 100% (209/209), done. remote: Compressing objects: 100% (119/119), done. remote: Total 15523 (delta 126), reused 123 (delta 78), pack-reused 15314 Receiving objects: 100% (15523/15523), 7.79 MiB | 542.00 KiB/s, done. Resolving deltas: 100% (9884/9884), done. % cd kube-prometheus % ls CHANGELOG.md experimental CONTRIBUTING.md go.mod DCO go.sum LICENSE jsonnet Makefile jsonnetfile.json README.md jsonnetfile.lock.json RELEASE.md kubescape-exceptions.json build.sh kustomization.yaml code-of-conduct.md manifests developer-workspace scripts docs sync-to-internal-registry.jsonnet example.jsonnet tests examples %
Step 2. Apply the kube-prometheus manifests
This is done in two steps. The first is to create the Custom Resource Definitions used by the Prometheus stack, then deploy the actual Prometheus stack objects.
% kubectl apply --server-side -f manifests/setup customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com serverside-applied customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com serverside-applied namespace/monitoring serverside-applied % kubectl get crd NAME CREATED AT alertmanagerconfigs.monitoring.coreos.com 2022-03-09T09:16:24Z alertmanagers.monitoring.coreos.com 2022-03-09T09:16:25Z backups.velero.io 2022-02-18T11:12:17Z backupstoragelocations.velero.io 2022-02-18T11:12:17Z cnsvolumeoperationrequests.cns.vmware.com 2022-02-10T11:38:35Z csinodetopologies.cns.vmware.com 2022-02-10T11:38:55Z deletebackuprequests.velero.io 2022-02-18T11:12:17Z downloadrequests.velero.io 2022-02-18T11:12:17Z podmonitors.monitoring.coreos.com 2022-03-09T09:16:25Z podvolumebackups.velero.io 2022-02-18T11:12:17Z podvolumerestores.velero.io 2022-02-18T11:12:17Z probes.monitoring.coreos.com 2022-03-09T09:16:25Z prometheuses.monitoring.coreos.com 2022-03-09T09:16:26Z prometheusrules.monitoring.coreos.com 2022-03-09T09:16:27Z resticrepositories.velero.io 2022-02-18T11:12:17Z restores.velero.io 2022-02-18T11:12:17Z schedules.velero.io 2022-02-18T11:12:18Z serverstatusrequests.velero.io 2022-02-18T11:12:18Z servicemonitors.monitoring.coreos.com 2022-03-09T09:16:27Z thanosrulers.monitoring.coreos.com 2022-03-09T09:16:27Z volumesnapshotclasses.snapshot.storage.k8s.io 2022-02-10T11:48:15Z volumesnapshotcontents.snapshot.storage.k8s.io 2022-02-10T11:48:16Z volumesnapshotlocations.velero.io 2022-02-18T11:12:18Z volumesnapshots.snapshot.storage.k8s.io 2022-02-10T11:48:17Z
A significant number of K8s objects will be created when the contents of the manifest folder are deployed.
% kubectl apply -f manifests/ alertmanager.monitoring.coreos.com/main created poddisruptionbudget.policy/alertmanager-main created prometheusrule.monitoring.coreos.com/alertmanager-main-rules created secret/alertmanager-main created service/alertmanager-main created serviceaccount/alertmanager-main created servicemonitor.monitoring.coreos.com/alertmanager-main created clusterrole.rbac.authorization.k8s.io/blackbox-exporter created clusterrolebinding.rbac.authorization.k8s.io/blackbox-exporter created configmap/blackbox-exporter-configuration created deployment.apps/blackbox-exporter created service/blackbox-exporter created serviceaccount/blackbox-exporter created servicemonitor.monitoring.coreos.com/blackbox-exporter created secret/grafana-config created secret/grafana-datasources created configmap/grafana-dashboard-alertmanager-overview created configmap/grafana-dashboard-apiserver created configmap/grafana-dashboard-cluster-total created configmap/grafana-dashboard-controller-manager created configmap/grafana-dashboard-grafana-overview created configmap/grafana-dashboard-k8s-resources-cluster created configmap/grafana-dashboard-k8s-resources-namespace created configmap/grafana-dashboard-k8s-resources-node created configmap/grafana-dashboard-k8s-resources-pod created configmap/grafana-dashboard-k8s-resources-workload created configmap/grafana-dashboard-k8s-resources-workloads-namespace created configmap/grafana-dashboard-kubelet created configmap/grafana-dashboard-namespace-by-pod created configmap/grafana-dashboard-namespace-by-workload created configmap/grafana-dashboard-node-cluster-rsrc-use created configmap/grafana-dashboard-node-rsrc-use created configmap/grafana-dashboard-nodes created configmap/grafana-dashboard-persistentvolumesusage created configmap/grafana-dashboard-pod-total created configmap/grafana-dashboard-prometheus-remote-write created configmap/grafana-dashboard-prometheus created configmap/grafana-dashboard-proxy created configmap/grafana-dashboard-scheduler created configmap/grafana-dashboard-workload-total created configmap/grafana-dashboards created deployment.apps/grafana created prometheusrule.monitoring.coreos.com/grafana-rules created service/grafana created serviceaccount/grafana created servicemonitor.monitoring.coreos.com/grafana created prometheusrule.monitoring.coreos.com/kube-prometheus-rules created clusterrole.rbac.authorization.k8s.io/kube-state-metrics created clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created deployment.apps/kube-state-metrics created prometheusrule.monitoring.coreos.com/kube-state-metrics-rules created service/kube-state-metrics created serviceaccount/kube-state-metrics created servicemonitor.monitoring.coreos.com/kube-state-metrics created prometheusrule.monitoring.coreos.com/kubernetes-monitoring-rules created servicemonitor.monitoring.coreos.com/kube-apiserver created servicemonitor.monitoring.coreos.com/coredns created servicemonitor.monitoring.coreos.com/kube-controller-manager created servicemonitor.monitoring.coreos.com/kube-scheduler created servicemonitor.monitoring.coreos.com/kubelet created clusterrole.rbac.authorization.k8s.io/node-exporter created clusterrolebinding.rbac.authorization.k8s.io/node-exporter created daemonset.apps/node-exporter created prometheusrule.monitoring.coreos.com/node-exporter-rules created service/node-exporter created serviceaccount/node-exporter created servicemonitor.monitoring.coreos.com/node-exporter created clusterrole.rbac.authorization.k8s.io/prometheus-k8s created clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created poddisruptionbudget.policy/prometheus-k8s created prometheus.monitoring.coreos.com/k8s created prometheusrule.monitoring.coreos.com/prometheus-k8s-prometheus-rules created rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created rolebinding.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s-config created role.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s created role.rbac.authorization.k8s.io/prometheus-k8s created service/prometheus-k8s created serviceaccount/prometheus-k8s created servicemonitor.monitoring.coreos.com/prometheus-k8s created apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created clusterrole.rbac.authorization.k8s.io/prometheus-adapter created clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created configmap/adapter-config created deployment.apps/prometheus-adapter created poddisruptionbudget.policy/prometheus-adapter created rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created service/prometheus-adapter created serviceaccount/prometheus-adapter created servicemonitor.monitoring.coreos.com/prometheus-adapter created clusterrole.rbac.authorization.k8s.io/prometheus-operator created clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created deployment.apps/prometheus-operator created prometheusrule.monitoring.coreos.com/prometheus-operator-rules created service/prometheus-operator created serviceaccount/prometheus-operator created servicemonitor.monitoring.coreos.com/prometheus-operator created % kubectl get servicemonitors -A NAMESPACE NAME AGE monitoring alertmanager-main 46s monitoring blackbox-exporter 45s monitoring coredns 25s monitoring grafana 27s monitoring kube-apiserver 25s monitoring kube-controller-manager 24s monitoring kube-scheduler 24s monitoring kube-state-metrics 26s monitoring kubelet 24s monitoring node-exporter 23s monitoring prometheus-adapter 17s monitoring prometheus-k8s 20s monitoring prometheus-operator 16s % kubectl get pod -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 95s alertmanager-main-1 2/2 Running 0 95s alertmanager-main-2 2/2 Running 0 95s blackbox-exporter-7d89b9b799-svr4t 3/3 Running 0 2m5s grafana-5577bc8799-b5bnd 1/1 Running 0 107s kube-state-metrics-d5754d6dc-spx4w 3/3 Running 0 106s node-exporter-8b44z 2/2 Running 0 103s node-exporter-jrxrc 2/2 Running 0 103s node-exporter-pj7nb 2/2 Running 0 103s prometheus-adapter-6998fcc6b5-dlqk6 1/1 Running 0 97s prometheus-adapter-6998fcc6b5-qswk4 1/1 Running 0 97s prometheus-k8s-0 2/2 Running 0 94s prometheus-k8s-1 2/2 Running 0 94s prometheus-operator-59647c66cf-ldppj 2/2 Running 0 96s % kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main ClusterIP 10.96.161.166 <none> 9093/TCP,8080/TCP 7m18s alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6m47s blackbox-exporter ClusterIP 10.104.28.233 <none> 9115/TCP,19115/TCP 7m17s grafana ClusterIP 10.97.77.202 <none> 3000/TCP 7m kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 6m58s node-exporter ClusterIP None <none> 9100/TCP 6m55s prometheus-adapter ClusterIP 10.104.10.57 <none> 443/TCP 6m50s prometheus-k8s ClusterIP 10.99.185.136 <none> 9090/TCP,8080/TCP 6m53s prometheus-operated ClusterIP None <none> 9090/TCP 6m46s prometheus-operator ClusterIP None <none> 8443/TCP 6m49s
Step 3. Adjust the ClusterRole prometheus-k8s
One necessary adjust is to the prometheus-k8s ClusterRole. When deployed through kube-prometheus, it does not have the necessary apiGroup resources and verbs rules to pick up vSphere CSI metrics. Therefore, it needs to be modified with the necessary rules before proceeding any further. Below, the rules when the ClusterRole is first created are displayed, followed by a new manifest which updates the rules. Lastly, the updated ClusterRole is displayed.
% kubectl get ClusterRole prometheus-k8s -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.33.4"},"name":"prometheus-k8s"},"rules":[{"apiGroups":[""],"resources":["nodes/metrics"],"verbs":["get"]},{"nonResourceURLs":["/metrics"],"verbs":["get"]}]} creationTimestamp: "2022-03-09T09:19:39Z" labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.33.4 name: prometheus-k8s resourceVersion: "7283142" uid: e18f021c-3e6e-4162-98ca-bbf912b75b06 rules: - apiGroups: - "" resources: - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get % cat prometheus-clusterRole-updated.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.33.0 name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - nonResourceURLs: - /metrics verbs: - get % kubectl apply -f prometheus-clusterRole-updated.yaml clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured % kubectl get ClusterRole prometheus-k8s -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.33.0"},"name":"prometheus-k8s"},"rules":[{"apiGroups":[""],"resources":["nodes","services","endpoints","pods"],"verbs":["get","list","watch"]},{"nonResourceURLs":["/metrics"],"verbs":["get"]}]} creationTimestamp: "2022-03-09T09:19:39Z" labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.33.0 name: prometheus-k8s resourceVersion: "7284231" uid: e18f021c-3e6e-4162-98ca-bbf912b75b06 rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods verbs: - get - list - watch - nonResourceURLs: - /metrics verbs: - get
Step 4. Create Service Monitor
To monitor any service through Prometheus, such as the vSphere CSI driver, a ServiceMonitor object must be created. The following is the manifest for the ServiceMonitor object that will be used to monitor “vsphere-csi-controller” service. The endpoints refer to ports 2112 (ctlr) and 2113 (syncer) respectively. Once deployed, the list of ServiceMonitors can be checked to ensure it it running.
% cat vsphere-csi-controller-service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: vsphere-csi-controller-prometheus-servicemonitor namespace: monitoring labels: name: vsphere-csi-controller-prometheus-servicemonitor spec: selector: matchLabels: app: vsphere-csi-controller namespaceSelector: matchNames: - vmware-system-csi endpoints: - port: ctlr - port: syncer % kubectl apply -f vsphere-csi-controller-service-monitor.yaml servicemonitor.monitoring.coreos.com/vsphere-csi-controller-prometheus-servicemonitor created % kubectl get servicemonitors -A NAMESPACE NAME AGE monitoring alertmanager-main 9m32s monitoring blackbox-exporter 9m31s monitoring coredns 9m11s monitoring grafana 9m13s monitoring kube-apiserver 9m11s monitoring kube-controller-manager 9m10s monitoring kube-scheduler 9m10s monitoring kube-state-metrics 9m12s monitoring kubelet 9m10s monitoring node-exporter 9m9s monitoring prometheus-adapter 9m3s monitoring prometheus-k8s 9m6s monitoring prometheus-operator 9m2s monitoring vsphere-csi-controller-prometheus-servicemonitor 42s
At this point, it is a good idea to check the logs on the prometheus-k8s-* nodes in the monitoring namespace. If there are issues with scraping metrics from the vSphere CSI driver, they will appear here. If you have not correctly updated the ClusterRole as mentioned previously, you may observe errors similar to this:
ts=2022-03-07T15:15:06.580Z caller=klog.go:116 level=error component=k8s_client_runtime
func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167:
Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden:
User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\"
in API group \"\" in the namespace \"vmware-system-csi\""
Step 5. Launch Prometheus UI
In step 2, after deploying the manifests, the list of services was displayed. You may have notices that prometheus-k8s service was of type ClusterIP. This means that it is an internal service and not accessible externally.
% kubectl get svc prometheus-k8s -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-k8s ClusterIP 10.99.185.136 <none> 9090/TCP,8080/TCP 97m
There are various ways to address this, such as change the service type to NodePort or LoadBalancer if you have one available to provide LoadBalancer IPs. The easiest way, for the purposes of our testing, is to simply port-forward the service and port (9090) and make it accessible from local host.
% kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090 Forwarding from 127.0.0.1:9090 -> 9090 Forwarding from [::1]:9090 -> 9090
Now you can open a browser on your desktop and connect to http://localhost:9090 to see the Prometheus UI. Here we can now check if we are getting metrics from the vSphere CSI driver, as described in the official docs. We should be able to see metrics called vsphere-csi-info and vsphere-syncer-info, amnog others. Simply type the name into the query field, and see if it is visible. Click on the Execute button to see further info. This is the controller info.
This is the syncer info.
This all looks good so we can proceed with launching the Grafana portal and creating a dashboard to display the metrics.
Step 6. Launch Grafana UI, create dashboard
The Grafana UI can be accessed in much the same way as the Prometheus UI. Again, it has been deployed with a ClusterIP type service, so not accessible outside of the Kubernetes Cluster. We can once again use the port-forward fucntionality to access it from a browser on the local host. This time the port is 3000.
% kubectl get svc grafana -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE grafana ClusterIP 10.97.77.202 <none> 3000/TCP 98m % kubectl --namespace monitoring port-forward svc/grafana 3000 Forwarding from 127.0.0.1:3000 -> 3000 Forwarding from [::1]:3000 -> 3000
We can now access the Grafana UI via http://localhost:3000. This is the initial login screen. Username and password is admin / admin.
On initial login, you are asked to provide a new password. You can add a new password or choose to skip this test.
The next step is to create a dashboard for the vSphere CSI driver metrics. The good news is that Liping has already created some sample Grafana dashboards for us to use, and these are available on GitHub here. The vSphere CSI dashboard shows metrics for CSI operations, whilst the vSphere CSI-CNS dashboard shows metrics for CNS operations observed at CSI layer. You can use these dashboards as a building block to create your own bespoke dashboards, should you so wish. Once logged in click on the + sign on the left hand side of the Grafana UI, and then select Import.
This opens a wizard that allows you to import a dashboard directly from Grafana using the dashboard URL or ID, or to copy and paste JSON contents for a dashboard. We can copy and paste the raw JSON from Liping’s dashboards on GitHub into the panel, as shown below. Once the JSON contents are pasted, click Load.
After loading, the only other task is to set the Prometheus Source. You simply select prometheus from the dropdown list (there should only be one), and click Import.
And now you should begin to see the vSphere CSI driver metrics that have been scraped and stored by Prometheus displayed in the Grafana dashboard. Leave it running for a while and you should begin to see some graphs populating, similar to the dashboard below.
And that completes the setup. You can now observe various vSphere CSI driver metrics appearing in the Grafana dashboards. Kudos once again to Liping Xue for doing the groundwork and documenting how to stand up and environment to demonstrate the metrics feature in version 2.5. The manifests used to create the correct ClusterRole and the Prometheus service monitor can be found here.