TKG v1.3 and the NSX Advanced Load Balancer

In my most recent post, we took a look at how Cluster API is utilized in TKG. Note that this post refers to the Tanzu Kubernetes Grid (TKG) multi-cloud version, sometimes referred to as TKGm. I will use this naming convention to refer to the multi-cloud TKG in this post, so that it is differentiated from other TKG products in the Tanzu portfolio. In this post, we will take a closer look at a new feature in TKG v1.3, namely the fact that it now supports the NSX ALB – Advanced Load Balancer (formerly known as AVI Vantage) – to provide virtual IP addresses for applications that utilize a load balancer service. I have already documented the steps on how to integrate the NSX ALB with vSphere with Tanzu. While the setup of the NSX Advanced Load Balancer for TKGm is very similar, the workflow is slightly different. The is a new installer workflow with with NSX ALB version 20.1.5 compared to version 20.1.4 deployed previously. There are also some additional steps in the TKG management cluster installer UI to accommodate the integration with the NSX ALB. We will cover those in this post.

Before we begin, it is important to highlight that TKGm continues to use Kube-VIP to provide a front-end virtual IP addresses for both the management cluster API server and the workload cluster API server. The NSX ALB, when integrated with TKGm, provides virtual IP addresses for applications that required a load balancer service. Thus, in the configuration file for both the management cluster and the workload cluster, a vSphere IP address endpoint for the respective cluster is specified. These IP addresses are considered static and must be outside of your DHCP IP address range.

NSX ALB Deployment

The deployment of the NSX ALB is identical for the most part to the deployment steps already outlined in the vSphere with Tanzu blog post. The big difference in version 20.1.5 is that after power on of the appliance and the initial configuration of the System Settings,  Email/SMTP and Multi-Tenant configuration, you can launch directly into the Setup Cloud After step for VMware vCenter/vSphere ESX, which appears on the bottom of the Welcome UI, as shown here:

The remaining steps to configure the NSX ALB are identical to those outlined previously, where details about vCenter and the vSphere environment are added, a certificate is created, a network and address range for both NSX ALB Service Engines and Load Balancing VIPs is chosen, an IPAM profile is created, and so on. Since these steps are already available, we will not repeat them here.

TKG Management Cluster

The deployment of the TKG management cluster has also been covered in detail in the Cluster API blog post. There is one additional piece that is relevant to us, and that is the inclusion of the NSX ALB section. In the TKG management cluster creation UI, this new (optional) NSX ALB section looks as follows, currently only partially populated:

The resulting manifest file for creating a TKG management cluster would then look something like this.  This file is created in the $HOME/.tanzu/tkg/clusterconfigs folder on the host where the TKGm UI installer is launched. Note that there is no LDAP or OIDC configured in this setup. Most of the NSX ALB configuration is at the beginning of the manifest.

AVI_CA_DATA_B64: LS0........
AVI_CLOUD_NAME: Default-Cloud
AVI_CONTROLLER: 10.35.13.40
AVI_DATA_NETWORK: VL3513-PRIV-DPG
AVI_DATA_NETWORK_CIDR: 10.35.13.0/24
AVI_ENABLE: "true"
AVI_LABELS: ""
AVI_PASSWORD: <encoded:Vk13YXJlMTIzIQ==>
AVI_SERVICE_ENGINE_GROUP: Default-Group
AVI_USERNAME: admin
CLUSTER_CIDR: 100.96.1.0/11
CLUSTER_NAME: tkgm-lb
CLUSTER_PLAN: dev
ENABLE_CEIP_PARTICIPATION: "false"
ENABLE_MHC: "true"
IDENTITY_MANAGEMENT_TYPE: none
INFRASTRUCTURE_PROVIDER: vsphere
LDAP_BIND_DN: ""
LDAP_BIND_PASSWORD: ""
LDAP_GROUP_SEARCH_BASE_DN: ""
LDAP_GROUP_SEARCH_FILTER: ""
LDAP_GROUP_SEARCH_GROUP_ATTRIBUTE: ""
LDAP_GROUP_SEARCH_NAME_ATTRIBUTE: cn
LDAP_GROUP_SEARCH_USER_ATTRIBUTE: DN
LDAP_HOST: ""
LDAP_ROOT_CA_DATA_B64: ""
LDAP_USER_SEARCH_BASE_DN: ""
LDAP_USER_SEARCH_FILTER: ""
LDAP_USER_SEARCH_NAME_ATTRIBUTE: ""
LDAP_USER_SEARCH_USERNAME: userPrincipalName
OIDC_IDENTITY_PROVIDER_CLIENT_ID: ""
OIDC_IDENTITY_PROVIDER_CLIENT_SECRET: ""
OIDC_IDENTITY_PROVIDER_GROUPS_CLAIM: ""
OIDC_IDENTITY_PROVIDER_ISSUER_URL: ""
OIDC_IDENTITY_PROVIDER_NAME: ""
OIDC_IDENTITY_PROVIDER_SCOPES: ""
OIDC_IDENTITY_PROVIDER_USERNAME_CLAIM: ""
SERVICE_CIDR: 100.64.1.0/13
TKG_HTTP_PROXY_ENABLED: "false"
VSPHERE_CONTROL_PLANE_DISK_GIB: "40"
VSPHERE_CONTROL_PLANE_ENDPOINT: 10.35.13.240
VSPHERE_CONTROL_PLANE_MEM_MIB: "8192"
VSPHERE_CONTROL_PLANE_NUM_CPUS: "2"
VSPHERE_DATACENTER: /CH-OCTO-DC
VSPHERE_DATASTORE: /CH-OCTO-DC/datastore/vsanDatastore
VSPHERE_FOLDER: /CH-OCTO-DC/vm/TKG
VSPHERE_NETWORK: VL3513-PRIV-DPG
VSPHERE_PASSWORD: <encoded:QWRtaW4hMjM=>
VSPHERE_RESOURCE_POOL: /CH-OCTO-DC/host/CH-Cluster/Resources
VSPHERE_SERVER: 10.35.13.116
VSPHERE_SSH_AUTHORIZED_KEY: ssh-rsa AAAA....== cormac
VSPHERE_TLS_THUMBPRINT: 56:C1:C5:47:FF:00:C0:68:97:FF:A5:14:6B:0E:37:65:3C:CF:48:90
VSPHERE_USERNAME: administrator@vsphere.local
VSPHERE_WORKER_DISK_GIB: "40"
VSPHERE_WORKER_MEM_MIB: "8192"
VSPHERE_WORKER_NUM_CPUS: "2"

By using the tanzu management-cluster create command, we can roll out the management cluster using the above configuration file. I have included the -v 6 option so that the output is more verbose. Note that once the installer detects vSphere 7, it offers the option of setting up vSphere with Tanzu rather than TKGm. You need to respond appropriately to continue with the TKGm management cluster deployment. This goes through the typical Cluster API deployment model of creating a very small kind (Kubernetes in Docker) cluster, adding the Cluster API provider extensions to the kind cluster, and then using those providers to build a TKG management cluster backed by VMs on vSphere. Once the management cluster is up and running on VMs, the context is switched to this cluster and the kind cluster is removed. Here is the complete output.

$ tanzu management-cluster create --file ./z2l657j5bm.yaml -v 6
CEIP Opt-in status: false

Validating the pre-requisites...

vSphere 7.0 with Tanzu Detected.

You have connected to a vSphere 7.0 with Tanzu environment that includes an integrated Tanzu Kubernetes Grid Service which
turns a vSphere cluster into a platform for running Kubernetes workloads in dedicated resource pools. Configuring Tanzu
Kubernetes Grid Service is done through the vSphere HTML5 Client.

Tanzu Kubernetes Grid Service is the preferred way to consume Tanzu Kubernetes Grid in vSphere 7.0 environments. Alternatively you may
deploy a non-integrated Tanzu Kubernetes Grid instance on vSphere 7.0.
Note: To skip the prompts and directly deploy a non-integrated Tanzu Kubernetes Grid instance on vSphere 7.0, you can set the 'DEPLOY_TKG_ON_VSPHERE7' configuration variable to 'true'

Do you want to configure vSphere with Tanzu? [y/N]: N
Would you like to deploy a non-integrated Tanzu Kubernetes Grid management cluster on vSphere 7.0? [y/N]: y
Deploying TKG management cluster on vSphere 7.0 ...
Identity Provider not configured. Some authentication features won't work.
no os options provided, selecting based on default os options

Setting up management cluster...
Validating configuration...
Using infrastructure provider vsphere:v0.7.7
Generating cluster configuration...
Setting up bootstrapper...
Fetching configuration for kind node image...
kindConfig:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  imageRepository: projects.registry.vmware.com/tkg
  etcd:
    local:
      imageRepository: projects.registry.vmware.com/tkg
      imageTag: v3.4.13_vmware.7
  dns:
    type: CoreDNS
    imageRepository: projects.registry.vmware.com/tkg
    imageTag: v1.7.0_vmware.8
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".registry.configs."cormac-tkgm.corinternal.com".tls]
    insecure_skip_verify = true
Creating kind cluster: tkg-kind-c33lkn8r994jb6dpv2p0
Ensuring node image (cormac-tkgm.corinternal.com/library/kind/node:v1.20.5_vmware.1) ...
Image: cormac-tkgm.corinternal.com/library/kind/node:v1.20.5_vmware.1 present locally
Preparing nodes ...
Writing configuration ...
Starting control-plane ...
Installing CNI ...
Installing StorageClass ...
Waiting 2m0s for control-plane = Ready ...
Ready after 35s
Bootstrapper created. Kubeconfig: /home/cormac/.kube-tkg/tmp/config_SxcKk9Ia
Installing providers on bootstrapper...
Fetching providers
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.14" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.14" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.14" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.7" TargetNamespace="capv-system"
installed Component=="cluster-api" Type=="CoreProvider" Version=="v0.3.14"
installed Component=="kubeadm" Type=="BootstrapProvider" Version=="v0.3.14"
installed Component=="kubeadm" Type=="ControlPlaneProvider" Version=="v0.3.14"
installed Component=="vsphere" Type=="InfrastructureProvider" Version=="v0.7.7"
Waiting for provider infrastructure-vsphere
Waiting for provider cluster-api
Waiting for provider control-plane-kubeadm
Waiting for provider bootstrap-kubeadm
Waiting for resource capi-kubeadm-control-plane-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capi-kubeadm-control-plane-controller-manager' in namespace 'capi-kubeadm-control-plane-system', retrying
Waiting for resource capv-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
Waiting for resource capi-controller-manager of type *v1.Deployment to be up and running
Waiting for resource capi-kubeadm-bootstrap-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capi-kubeadm-control-plane-controller-manager' in namespace 'capi-kubeadm-control-plane-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
Waiting for resource capi-kubeadm-control-plane-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider control-plane-kubeadm after 10.14984304s
Waiting for resource capv-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capi-webhook-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
Waiting for resource capi-kubeadm-bootstrap-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-webhook-system', retrying
Passed waiting on provider infrastructure-vsphere after 15.238228291s
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
Passed waiting on provider bootstrap-kubeadm after 15.241068789s
Waiting for resource capi-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider cluster-api after 20.256005572s
Success waiting on all providers.
Start creating management cluster...
patch cluster object with operation status:
        {
         "metadata": {
                 "annotations": {
                         "TKGOperationInfo" : "{\"Operation\":\"Create\",\"OperationStartTimestamp\":\"2021-06-14 13:34:45.170398929 +0000 UTC\",\"OperationTimeout\":1800}",
                         "TKGOperationLastObservedTimestamp" : "2021-06-14 13:34:45.170398929 +0000 UTC"
                         }
                 }
         }
cluster control plane is still being initialized, retrying
cluster control plane is still being initialized, retrying
cluster control plane is still being initialized, retrying
cluster control plane is still being initialized, retrying
cluster control plane is still being initialized, retrying
Getting secret for cluster
Waiting for resource tkgm-lb-kubeconfig of type *v1.Secret to be up and running
Saving management cluster kubeconfig into /home/cormac/.kube/config
Installing providers on management cluster...
Fetching providers
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v0.3.14" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.14" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.14" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.7" TargetNamespace="capv-system"
installed Component=="cluster-api" Type=="CoreProvider" Version=="v0.3.14"
installed Component=="kubeadm" Type=="BootstrapProvider" Version=="v0.3.14"
installed Component=="kubeadm" Type=="ControlPlaneProvider" Version=="v0.3.14"
installed Component=="vsphere" Type=="InfrastructureProvider" Version=="v0.7.7"
Waiting for provider infrastructure-vsphere
Waiting for provider control-plane-kubeadm
Waiting for provider cluster-api
Waiting for provider bootstrap-kubeadm
Waiting for resource capi-kubeadm-control-plane-controller-manager of type *v1.Deployment to be up and running
Waiting for resource capv-controller-manager of type *v1.Deployment to be up and running
Waiting for resource capi-kubeadm-bootstrap-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capi-kubeadm-control-plane-controller-manager' in namespace 'capi-kubeadm-control-plane-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
Waiting for resource capi-controller-manager of type *v1.Deployment to be up and running
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capi-kubeadm-control-plane-controller-manager' in namespace 'capi-kubeadm-control-plane-system', retrying
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capi-kubeadm-control-plane-controller-manager' in namespace 'capi-kubeadm-control-plane-system', retrying
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
Waiting for resource capi-kubeadm-control-plane-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider control-plane-kubeadm after 15.104112709s
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
pods are not yet running for deployment 'capi-kubeadm-bootstrap-controller-manager' in namespace 'capi-kubeadm-bootstrap-system', retrying
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
pods are not yet running for deployment 'capv-controller-manager' in namespace 'capv-system', retrying
Waiting for resource capi-kubeadm-bootstrap-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider bootstrap-kubeadm after 25.113845112s
pods are not yet running for deployment 'capi-controller-manager' in namespace 'capi-system', retrying
Waiting for resource capv-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider infrastructure-vsphere after 30.11258043s
Waiting for resource capi-controller-manager of type *v1.Deployment to be up and running
Passed waiting on provider cluster-api after 30.152691662s
Success waiting on all providers.
Waiting for the management cluster to get ready for move...
Waiting for resource tkgm-lb of type *v1alpha3.Cluster to be up and running
Waiting for resources type *v1alpha3.MachineDeploymentList to be up and running
Waiting for resources type *v1alpha3.MachineList to be up and running
Waiting for addons installation...
Waiting for resources type *v1alpha3.ClusterResourceSetList to be up and running
Waiting for resource antrea-controller of type *v1.Deployment to be up and running
Moving all Cluster API objects from bootstrap cluster to management cluster...
Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Creating objects in the target cluster
Deleting objects from the source cluster
Waiting for additional components to be up and running...
Waiting for resource tanzu-addons-controller-manager of type *v1.Deployment to be up and running
Waiting for resource tkr-controller-manager of type *v1.Deployment to be up and running
Waiting for resource kapp-controller of type *v1.Deployment to be up and running
pods are not yet running for deployment 'tanzu-addons-controller-manager' in namespace 'tkg-system', retrying
pods are not yet running for deployment 'tanzu-addons-controller-manager' in namespace 'tkg-system', retrying
Context set for management cluster tkgm-lb as 'tkgm-lb-admin@tkgm-lb'.
Deleting kind cluster: tkg-kind-c33lkn8r994jb6dpv2p0

Management cluster created!

You can now create your first workload cluster by running the following:

 tanzu cluster create [name] -f [file]

Some addons might be getting installed! Check their status by running the following:

 kubectl get apps -A

$ kubectl get apps -A

NAMESPACE   NAME                 DESCRIPTION          SINCE-DEPLOY  AGE
tkg-system  antrea               Reconcile succeeded  2m26s         2m28s
tkg-system  metrics-server       Reconcile succeeded    99s         2m29s
tkg-system  tanzu-addons-manager Reconcile succeeded  2m28s         5m23s
tkg-system  vsphere-cpi          Reconcile succeeded   102s         2m28s
tkg-system  vsphere-csi          Reconcile succeeded   115s         2m28s

The TKG management cluster is now up and running, and all the add-ons are present, such as Antrea for the networking and vSphere CSI to allow K8s to consume vSphere storage. We can use the tanzu login command to select this management cluster and log into it.

$ tanzu login
? Select a server tkgm-lb ()
✔ successfully logged in to management cluster using the kubeconfig tkgm-lb

Note that the NSX ALB has not been called on to provide a VIP yet. The VIP for the management cluster is provided by kube-vip. We can now proceed with the deployment of a workload cluster.

TKG Workload Cluster

To deploy a workload cluster, we once again create a configuration file. Here is the one which I used in my environment. A sample is provided in $HOME/.tanzu/tkg/clusterconfigs. In this example, the workload cluster is being deployed in an air-gapped environment which has not access to the internet. Thus, all of my TKG images are in a local Harbor registry, referenced by TKG_CUSTOM_IMAGE_REPOSITORY in the configuration file.

$ cat tkgm-workload-lb.yaml
#! -- See https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.3/vmware-tanzu-kubernetes-grid-13/GUID-tanzu-k8s-clusters-vsphere.html
#
#! ---------------------------------------------------------------------
#! Basic cluster creation configuration
#! ---------------------------------------------------------------------
#
CLUSTER_NAME: tkgm-workload-lb
CLUSTER_PLAN: prod
CNI: antrea

#! ---------------------------------------------------------------------
#! Node configuration
#! ---------------------------------------------------------------------

CONTROL_PLANE_MACHINE_COUNT: 1
WORKER_MACHINE_COUNT: 2
VSPHERE_CONTROL_PLANE_NUM_CPUS: 2
VSPHERE_CONTROL_PLANE_DISK_GIB: 40
VSPHERE_CONTROL_PLANE_MEM_MIB: 8192
VSPHERE_WORKER_NUM_CPUS: 2
VSPHERE_WORKER_DISK_GIB: 40
VSPHERE_WORKER_MEM_MIB: 4096

#! ---------------------------------------------------------------------
#! vSphere configuration
#! ---------------------------------------------------------------------

VSPHERE_NETWORK: VL3513-PRIV-DPG
VSPHERE_DATACENTER: CH-OCTO-DC
VSPHERE_RESOURCE_POOL: /CH-OCTO-DC/host/CH-Cluster/Resources
VSPHERE_USERNAME: "administrator@vsphere.local"
VSPHERE_PASSWORD: <encoded:QWRtaW4hMjM=>
VSPHERE_SERVER: 10.35.13.116
VSPHERE_DATASTORE: /CH-OCTO-DC/datastore/vsanDatastore
VSPHERE_FOLDER: /CH-OCTO-DC/vm/TKG
VSPHERE_INSECURE: true
VSPHERE_SSH_AUTHORIZED_KEY: adfgsrhsrh
VSPHERE_TLS_THUMBPRINT: 56:C1:C5:....
VSPHERE_CONTROL_PLANE_ENDPOINT: 10.35.13.151

#! ---------------------------------------------------------------------
#! Common configuration
#! ---------------------------------------------------------------------

TKG_CUSTOM_IMAGE_REPOSITORY: "http://cormac-tkgm.corinternal.com/library"
TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY: true

ENABLE_DEFAULT_STORAGE_CLASS: true

CLUSTER_CIDR: 100.96.3.0/11
SERVICE_CIDR: 100.64.3.0/13

We can now apply this configuration file to build the workload cluster. When applied using the tanzu command, this requests the creation of a workload cluster with 1 control plane node and 2 worker nodes. Once deployed, we follow up by getting the new workload cluster context, adding the new workload cluster context to our KUBECONFIG and then switching to that workload context. I have not added any verbosity to this output. Note that there is a warning about no Pinniped configuration, meaning that there are no OIDC or LDAP Identity Providers configured. You will thus need to use admin provileges to access the workload cluster. OIDC and LDAP identity management with Pinniped and Dex is another key feature in TKGm v1.3.

$ tanzu cluster create --file ./tkgm-workload-lb.yaml
Validating configuration...
Warning: Pinniped configuration not found. Skipping pinniped configuration in workload cluster. Please refer to the documentation to check if you can configure pinniped on workload cluster manually
Creating workload cluster 'tkgm-workload-lb'...
Waiting for cluster to be initialized...
Waiting for cluster nodes to be available...
Waiting for addons installation...

Workload cluster 'tkgm-workload-lb' created


$ kubectl config get-contexts
CURRENT NAME                  CLUSTER  AUTHINFO         NAMESPACE
*       tkgm-lb-admin@tkgm-lb tkgm-lb  tkgm-lb-admin


$ tanzu cluster list
 NAME              NAMESPACE STATUS  CONTROLPLANE WORKERS KUBERNETES        ROLES  PLAN
 tkgm-workload-lb  default   running 1/1          2/2     v1.20.5+vmware.1  <none> prod


$ tanzu cluster kubeconfig get tkgm-workload-lb
Error: failed to get pinniped-info from management cluster: failed to get pinniped-info from the cluster
...
 

$ tanzu cluster kubeconfig get tkgm-workload-lb --admin
Credentials of cluster 'tkgm-workload-lb' have been saved
You can now access the cluster by running 'kubectl config use-context tkgm-workload-lb-admin@tkgm-workload-lb'


$ kubectl config get-contexts
CURRENT  NAME                                    CLUSTER           AUTHINFO                NAMESPACE
*        tkgm-lb-admin@tkgm-lb                   tkgm-lb           tkgm-lb-admin
         tkgm-workload-lb-admin@tkgm-workload-lb tkgm-workload-lb  tkgm-workload-lb-admin


$ kubectl config use-context tkgm-workload-lb-admin@tkgm-workload-lb
Switched to context "tkgm-workload-lb-admin@tkgm-workload-lb".


$ kubectl get nodes
NAME                                   STATUS ROLES                  AGE     VERSION
tkgm-workload-lb-control-plane-smxtd   Ready  control-plane,master   4m58s   v1.20.5+vmware.1
tkgm-workload-lb-md-0-7986c58d4b-g92tl Ready <none>                  3m29s   v1.20.5+vmware.1
tkgm-workload-lb-md-0-7986c58d4b-zb58j Ready <none>                  3m31s   v1.20.5+vmware.1

The workload cluster is now up and running, but once again it has not used the NSX ALB for any VIPs. The VIP (endpoint) was defined in the configuration file, and Kube-VIP was once again used to configure it as the front-end IP address for the workload cluster’s API server. Let’s now proceed with the deployment of an application which required a service of type load balancer, and then we will see NSX ALB providing this VIP.

Load Balancer application

To test out the NSX ALB, we will use an Nginx Web Server app. Here is the manifest. It is a deployment made up of 3 replicas, and an associated Load Balancer Service.

$ cat nginx-from-harbor.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 3 # tells deployment to run 3 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: cormac-tkgm.corinternal.com/library/nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx-svc
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: nginx

When we apply this application, and we should observe a Load Balancer IP address/Virtual IP (VIP) getting allocated to the service from our NSX ALB.

$ kubectl apply -f nginx-from-harbor.yaml
deployment.apps/nginx-deployment created
service/nginx-svc created


$ kubectl get svc
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)         AGE
kubernetes   ClusterIP      100.64.0.1     <none>         443/TCP         72m
nginx-svc    LoadBalancer   100.70.57.45   10.35.13.192   80:30922/TCP    4m26s


$ ping 10.35.13.192
PING 10.35.13.192 (10.35.13.192) 56(84) bytes of data.
64 bytes from 10.35.13.192: icmp_seq=1 ttl=64 time=0.303 ms
64 bytes from 10.35.13.192: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 10.35.13.192: icmp_seq=3 ttl=64 time=0.137 ms
^C
--- 10.35.13.192 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 0.137/0.196/0.303/0.077 ms

We appear to have successfully received a VIP from the NSX ALB on the range which I configured on the NSX ALB. You may need to wait a short time for the service to become active before it responds to a ping request. The final test is to see if we can reach the Nginx web server default landing page on http port 80.

$ curl 10.35.13.192
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Success! Our NSX ALB is now providing VIPs for Load Balancer services in our TKG workload cluster.

Troubleshooting NSX ALB on TKGm

This integration between the NSX ALB and TKG is provided by a new AKO extension. A special AKO pod can be queried if there are issues with the VIP getting provided. Here is a logs snippet from the AKO pod on my setup. I have found these logs very useful for identifying misconfiguration issues on the NSX ALB.

$ kubectl logs ako-0 -n avi-system
2021-06-14T13:58:21.274Z INFO api/api.go:52 Setting route for GET /api/status
2021-06-14T13:58:21.274Z INFO ako-main/main.go:61 AKO is running with version: v1.3.1
2021-06-14T13:58:21.274Z INFO api/api.go:110 Starting API server at :8080
2021-06-14T13:58:21.274Z INFO ako-main/main.go:67 We are running inside kubernetes cluster. Won't use kubeconfig files.
2021-06-14T13:58:21.282Z INFO utils/ingress.go:36 networking.k8s.io/v1/IngressClass not found/enabled on cluster: ingressclasses.networking.k8s.io is forbidden: User "system:serviceaccount:avi-system:ako-sa" cannot list resource "ingressclasses" in API group "networking.k8s.io" at the cluster scope
2021-06-14T13:58:21.282Z INFO utils/utils.go:166 Initializing configmap informer in avi-system
2021-06-14T13:58:21.282Z INFO lib/cni.go:96 Skipped initializing dynamic informers

2021-06-14T13:58:21.472Z INFO utils/avi_rest_utils.go:99 Setting the client version to the current controller version 20.1.5
2021-06-14T13:58:21.492Z INFO cache/avi_ctrl_clients.go:72 Setting the client version to 20.1.5
2021-06-14T13:58:21.492Z INFO cache/avi_ctrl_clients.go:72 Setting the client version to 20.1.5
2021-06-14T13:58:21.560Z INFO cache/controller_obj_cache.go:2641 Setting cloud vType: CLOUD_VCENTER
.
.
.
2021-06-14T15:04:10.521Z INFO rest/dequeue_nodes.go:577 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: creating/updating Pool cache, method: POST
2021-06-14T15:04:10.521Z INFO rest/avi_obj_pool.go:267 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: Added Pool cache k {admin default-tkgm-workload-lb--default-nginx-svc--80} val {default-tkgm-workload-lb--default-nginx-svc--80 admin pool-ba01367d-ac34-4262-a0b8-0ffc3caea09f 146553777 {<nil>   <nil> <nil> {  } 0   } { } 1623683050279030 false false}

2021-06-14T15:04:10.521Z INFO rest/dequeue_nodes.go:577 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: creating/updating L4PolicySet cache, method: POST
2021-06-14T15:04:10.521Z INFO rest/avi_obj_l4ps.go:191 Modified the VS cache for l4s object. The cache now is :{"Name":"default-tkgm-workload-lb--default-nginx-svc","Tenant":"admin","Uuid":"","Vip":"","CloudConfigCksum":"","PGKeyCollection":null,"VSVipKeyCollection":[{"Namespace":"admin","Name":"default-tkgm-workload-lb--default-nginx-svc"}],"PoolKeyCollection":[{"Namespace":"admin","Name":"default-tkgm-workload-lb--default-nginx-svc--443"},{"Namespace":"admin","Name":"default-tkgm-workload-lb--default-nginx-svc--80"}],"DSKeyCollection":null,"HTTPKeyCollection":null,"SSLKeyCertCollection":null,"L4PolicyCollection":[{"Namespace":"admin","Name":"default-tkgm-workload-lb--default-nginx-svc"}],"SNIChildCollection":null,"ParentVSRef":{"Namespace":"","Name":""},"PassthroughParentRef":{"Namespace":"","Name":""},"PassthroughChildRef":{"Namespace":"","Name":""},"ServiceMetadataObj":{"namespace_ingress_name":null,"ingress_name":"","namespace":"","hostnames":null,"namespace_svc_name":null,"crd_status":{"type":"","value":"","status":""},"pool_ratio":0,"passthrough_parent_ref":"","passthrough_child_ref":"","gateway":""},"LastModified":"","InvalidData":false,"VSCacheLock":{}}
2021-06-14T15:04:10.521Z INFO rest/avi_obj_l4ps.go:200 Added L4 Policy Set cache k {admin default-tkgm-workload-lb--default-nginx-svc} val {default-tkgm-workload-lb--default-nginx-svc admin l4policyset-72c5338c-c68e-442e-9771-d5905a48aa1f 3045110946 [default-tkgm-workload-lb--default-nginx-svc--443 default-tkgm-workload-lb--default-nginx-svc--80] 1623683050397000 false}

2021-06-14T15:04:10.521Z INFO rest/dequeue_nodes.go:577 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: creating/updating VirtualService cache, method: POST
2021-06-14T15:04:10.521Z INFO rest/avi_obj_vs.go:374 key:admin/default-tkgm-workload-lb--default-nginx-svc, msg: Service Metadata: {"namespace_ingress_name":null,"ingress_name":"","namespace":"","hostnames":null,"namespace_svc_name":["default/nginx-svc"],"crd_status":{"type":"","value":"","status":""},"pool_ratio":0,"passthrough_parent_ref":"","passthrough_child_ref":"","gateway":""}
2021-06-14T15:04:10.521Z INFO rest/avi_obj_vs.go:396 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: updated vsvip to the cache: 10.35.13.192
2021-06-14T15:04:10.521Z WARN status/svc_status.go:37 Service hostname not found for service [default/nginx-svc] status update
2021-06-14T15:04:10.536Z INFO status/svc_status.go:83 key: admin/default-tkgm-workload-lb--default-nginx-svc, msg: Successfully updated the status of serviceLB: default/nginx-svc old: [] new [{IP:10.35.13.192 Hostname:}]
.
.
.

And of course, if all has gone as expected, one should be able to observe the Service Engines being deployed in the vSphere UI, and on logging into the NSX ALB management portal, we should see the Virtual Service on appear. Here is a view taken from the Applications > Dashboard:

This is the Applications > Virtual Services view.

A really nice view is to return to Dashboard, and instead of View VS List, do a View VS Tree and click on the + sign to expand it. This shows the VS, Pool, the network and the IP addresses of the TKG nodes that are associated with the service. In this case, since it our Nginx application has 3 replicas, we see 3 different nodes displayed.

Our TKG workload cluster is now integrated with the NSX Advanced Load Balancer and it is successfully providing Virtual IP addresses (VIPs) for applications in my TKG cluster that request a Load Balancer service.