WaveFront Collector Issues: Error in scraping containers

I was very pleased last week, as I managed to get a bunch of metrics sent from my Kubernetes cluster into Wavefront by chaining proxies together. I was successfully able to see my cluster’s Kube-state Metrics and Kubernetes Collector Metrics in Wavefront. However, on closer inspection, I noticed that a number of the built-in Wavefront Kubernetes dashboards were not being populated (Kubernetes Metrics and Kubernetes Metrics by Namespace), and then I found a number of errors in the Wavefront collector logs in my deployment. This post will describe what these errors were, and how I rectified them.

There were 2 distinct errors related to scraping containers (i.e. gathering logs from containers). First, there were the ones related to the kubelet (the part of Kubernetes that runs on the nodes). I had one of these errors for each of the nodes in the Kubernetes cluster, in my case 3. I was able to view these errors by displaying the logs on the Wavefront Collector Pod via kubectl log :

E0811 09:01:05.002411       1 manager.go:124] Error in scraping containers from \
kubelet_summary:192.168.192.5:10255: Get http://192.168.192.5:10255/stats/summary/: \
dial tcp 192.168.192.5:10255: connect: connection refused
E0811 09:01:05.002573       1 manager.go:124] Error in scraping containers from \
kubelet_summary:192.168.192.3:10255: Get http://192.168.192.3:10255/stats/summary/: \
dial tcp 192.168.192.3:10255: connect: connection refused
E0811 09:01:05.032201       1 manager.go:124] Error in scraping containers from \
kubelet_summary:192.168.192.4:10255: Get http://192.168.192.4:10255/stats/summary/: \
dial tcp 192.168.192.4:10255: connect: connection refused

There was a second error observed as well. This one was against the kube-dns service (which corresponds to the 10.100.200.2 IP address:

E0811 09:01:05.008521       1 manager.go:124] Error in scraping containers from \
prometheus_source: http://10.100.200.2:9153/metrics: Get http://10.100.200.2:9153/metrics: \
dial tcp 10.100.200.2:9153: connect: network is unreachable

Thus, two distinct problems. My Kubernetes nodes were not allowing the Wavefront collector to connect with a connection refused on port 10255 on the nodes, and when it tried to connect to the kube-dns metrics port 9153, it simply not reachable.

Let’s concentrate on the kubelet issue first. This appears to be a common enough issue where kubelets do not allow metrics to be retrieved on port 10255. I found a discussion online which suggested that the kubelets need to be started with –readonly-port=10255 on the nodes. A workaround was to use https kubelet port 10250 instead of the read-only http port 10255. To do that, the following change was made to the Wavefront collector YAML file:

from:
  - --source=kubernetes.summary_api:''

to:
  - --source=kubernetes.summary_api:''?kubeletHttps=true&kubeletPort=10250&insecure=true

This now allows metrics to be retrieved from the nodes. Let’s now look at the kube-dns issue. I found the solution to that issue in this online discussion. It seems that the Wavefront collector is configured to scrape a port 9153 named metrics from CoreDNS but the kube-dns service does NOT have this port configured. By editing the kube-dns service and adding the port, the issue was addressed. I’m not sure if this configuration whereby the port is configured on the Pod, but not on the Service, is a nuance of PKS, since i am using PKS to deploy my K8s clusters. On editing the service, simply add the new port to the port section of the manifest.

$ kubectl get svc -n kube-system kube-dns -o json | jq .spec.ports
[
  {
    "name": "dns",
    "port": 53,
    "protocol": "UDP",
    "targetPort": 53
  },
  {
    "name": "dns-tcp",
    "port": 53,
    "protocol": "TCP",
    "targetPort": 53
  }
]

$ kubectl get pods -n kube-system --selector k8s-app=kube-dns -o json | jq \
.items[0].spec.containers[].ports
[
  {
    "containerPort": 53,
    "name": "dns",
    "protocol": "UDP"
  },
  {
    "containerPort": 53,
    "name": "dns-tcp",
    "protocol": "TCP"
  },
  {
    "containerPort": 9153,
    "name": "metrics",
    "protocol": "TCP"
  }
]

$ kubectl edit svc -n kube-system kube-dns
service/kube-dns edited

$ kubectl get svc -n kube-system kube-dns -o json | jq .spec.ports
[
  {
    "name": "dns",
    "port": 53,
    "protocol": "UDP",
    "targetPort": 53
  },
  {
    "name": "dns-tcp",
    "port": 53,
    "protocol": "TCP",
    "targetPort": 53
  },
  {
    "name": "metrics",
    "port": 9153,
    "protocol": "TCP",
    "targetPort": 9153
  }
]

Now all scrapes are working, according to the logs:

I0812 09:30:05.000218       1 manager.go:91] Scraping metrics start: 2019-08-12 09:29:00 +0000 UTC, end: 2019-08-12 09:30:00 +0000 UTC
I0812 09:30:05.000282       1 manager.go:96] Scraping sources from provider: internal_stats_provider
I0812 09:30:05.000289       1 manager.go:96] Scraping sources from provider: prometheus_metrics_provider: kube-system-service-kube-dns
I0812 09:30:05.000317       1 manager.go:96] Scraping sources from provider: prometheus_metrics_provider: kube-system-service-kube-state-metrics
I0812 09:30:05.000324       1 manager.go:96] Scraping sources from provider: prometheus_metrics_provider: pod-velero-7d97d7ff65-drl5c
I0812 09:30:05.000329       1 manager.go:96] Scraping sources from provider: kubernetes_summary_provider
I0812 09:30:05.000364       1 summary.go:452] nodeInfo: [nodeName:fd8f9036-189f-447c-bbac-71a9fea519c0 hostname:192.168.192.3 hostID: ip:192.168.192.3]
I0812 09:30:05.000374       1 summary.go:452] nodeInfo: [nodeName:ebbb4c31-375b-4b17-840d-db0586dd948b hostname:192.168.192.4 hostID: ip:192.168.192.4]
I0812 09:30:05.000409       1 summary.go:452] nodeInfo: [nodeName:140ab5aa-0159-4612-b68c-df39dbea2245 hostname:192.168.192.5 hostID: ip:192.168.192.5]
I0812 09:30:05.006776       1 manager.go:120] Querying source: internal_stats_source
I0812 09:30:05.007593       1 manager.go:120] Querying source: prometheus_source: http://172.16.6.2:8085/metrics
I0812 09:30:05.010829       1 manager.go:120] Querying source: kubelet_summary:192.168.192.3:10250
I0812 09:30:05.011518       1 manager.go:120] Querying source: prometheus_source: http://10.100.200.187:8080/metrics
I0812 09:30:05.034885       1 manager.go:120] Querying source: kubelet_summary:192.168.192.4:10250
I0812 09:30:05.037789       1 manager.go:120] Querying source: prometheus_source: http://10.100.200.2:9153/metrics
I0812 09:30:05.053807       1 manager.go:120] Querying source: kubelet_summary:192.168.192.5:10250
I0812 09:30:05.308996       1 manager.go:179] ScrapeMetrics: time: 308.554053ms size: 83
I0812 09:30:05.311586       1 manager.go:92] Pushing data to: Wavefront Sink
I0812 09:30:05.311602       1 manager.go:95] Data push completed: Wavefront Sink
I0812 09:30:05.311611       1 wavefront.go:122] received metric points: 28894
I0812 09:30:05.947893       1 wavefront.go:133] received metric sets: 91

Now when I log on to Wavefront, I am able to see my Kubernetes cluster in some of the additional dashboards (the ones where my cluster did not previously appear). Select Integrations > Kubernetes > Dashboards. Select Kubernetes Metrics as the dashboard to view. From the drop down, select the name of the cluster. My cluster (cork8s-cluster-01) is now available. Previously it was not in the list of clusters.

The other dashboard where my cluster was not visible was the Kubernetes Metric by Namespace dashboard. Now I can see my cluster here as well. In this dashboard, select the namespace that you are interested in monitoring.

And that completes the post. I now have my K8s cluster sending all metrics back to Wavefront for monitoring. I do want to add that this was how I got these issues resolved in my own lab. For production related issues, I would certainly speak to the Wavefront team and verify that there are no gotchas before implementing these workarounds.