TKG v1.4 & NSX ALB – Ingress Health Monitor Anomaly

Cormac

3 years ago

As I continue to look at TKG version 1.4, I wanted to start using VMware NSX Advanced Load Balancer integrated with the Project Contour (Envoy Ingress) package. Project Contour is a control plane for the Envoy Ingress that is included with the package, but which also has the ability to dynamically change the Ingress configuration. It is included as an add-on package to TKG v1.4. To use it, I deployed a TKG management cluster and a TKG workload cluster using an NSX ALB (v 20.1.5) for the Load Balancing Service. I then proceeded to deploy the Contour package. While the deployment was successful from a Kubernetes perspective, and the Envoy was allocated an IP address from the range of VIPs on the NSX ALB, the Health Monitor (System-TCP) associated with the Ingress was failing. Here is the Load Balancer IP successfully allocated to the envoy service.

$ kubectl get svc envoy -n tanzu-system-ingress
NAME    TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
envoy   LoadBalancer   100.71.211.12   xx.xx.13.198   80:31742/TCP,443:31508/TCP   19m

The first two octets of the EXTERNAL-IP hve been intentionally obfuscated.

However, then you login to the NSX ALB portal, you will see something similar to the following in the Applications > Pools view. The Virtual Service and VIPs are also being reported as down.

On looking at the Operations > Events in the NSX ALB portal, you will observe “Connection Refused” message when the Health Monitor tries to check Envoy Ingress service. This may also be accompanied by another event which has a failure code of “Other Error”.

While it is not fully clear to me why the Health Monitor is not working with System-TCP for the Ingress, it does not seem to prevent the service from allocating a VIP to the Ingress service as we have seen, so Ingress functionality is unaffected. However, to allow the Health Monitor to succeed, I simply changed the Health Monitor in the NSX ALB from System-TCP to System-Ping. If you wish to do the same, navigate to Applications > Pools, and then edit the Pools associated with the Ingress service. There will be one for port 80 and another for port 443. In the settings section, on the lower left hand side you will see the Health Monitors configuration, which will be set to System-TCP.

From the drop-down menu, change this to System-Ping. Save the changes.

You should immediately notice the health of the Ingress service begin to improve.

While I am not yet clear as to why this issue occurs, as soon as I have the root cause, I will update this post. However, this will at least provide a healthier services view in the NSX ALB UI, should you happen to be using an Ingress with the NSX ALB.

[Update] So after some further testing, it seems that once an Ingress consumer (in my case Prometheus) gets deployed, the Pool comes online, along with the other components such as Virtual Service and VIPs. So in fact, you don’t even need to change the Health Monitor to System-Ping from System-TCP after all.

Here is a view of the Project Contour Virtual Service once a HTTPProxy resource required for Prometheus was created. This is an advanced resource type provided by Contour which provides additional benefits over an Ingress.