Reviewing PKS logs and status

After a bit of a sabbatical, I am back to looking PKS (Pivotal Container Service) again. I wanted to look at the new version 1.3, but I had to do a bit of work on my environment to allow me to do this. Primarily, I needed to upgrade my NSX-T environment from version 2.1 to 2.3. I followed this blog post from vmtechie which provides a useful step-by-step guide. Kudos to our VMware NSX-T team as the upgrade worked without a hitch. My next step was to start work on the PKS deployment. I just did a brand new deployment of PKS 1.3. I used the following set of blogs post from Keith Lee which provided invaluable help. Now, because I have a weird lab network setup, I have my PKS components on one VLAN and my vSphere management components on another VLAN. This means I have had to multi-homed a number of pieces of infra such as my vCenter appliance, my NSX Manager and my PCF OPs Manager. This caused me a bit a grief when it came to configuring the certificates (to establish trust between BOSH and PKS and the NSX Manager). Fortunately, like all these things, I learnt a little bit more about PKS troubleshooting during this exercise and I’ve decided to share some of these tidbits with you in this post.

On a few occasions, my PKS deployment got as far as this point and then keeled over (This is the verbose log output from the PKS (Pivotal Container Service) Tile by the way):

+ tags: {}
Task 24
Task 24 | 13:58:12 | Preparing deployment: Preparing deployment (00:00:06)
Task 24 | 13:58:31 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 24 | 13:58:32 | Compiling packages: golang-1-linux/8fb48ae1b653b7d0b49d0cbcea856bb8da8a5700 (00:02:55)
Task 24 | 14:01:27 | Compiling packages: bosh-dns/138f3bd2440ba97f0a7d8912facb5d4a2b320850 (00:00:36)
Task 24 | 14:02:18 | Creating missing vms: pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0) (00:02:37)
Task 24 | 14:04:55 | Updating instance pivotal-container-service: pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0) (canary) (00:07:28)
                  L Error: ‘pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0)’ is not running after update. Review logs for failed jobs: pks-nsx-t-osb-proxy
Task 24 | 14:12:23 | Error: ‘pivotal-container-service/506a36a9-c9b9-4c16-b1f0-e50a50512450 (0)’ is not running after update. Review logs for failed jobs: pks-nsx-t-osb-proxy
Task 24 Started  Mon Feb  4 13:58:12 UTC 2019
Task 24 Finished Mon Feb  4 14:12:23 UTC 2019
Task 24 Duration 00:14:11
Task 24 error
Updating deployment:
Expected task ’24’ to succeed but state is ‘error’
Exit code 1
===== 2019-02-04 14:12:23 UTC Finished “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.50.0.140 –deployment=pivotal-container-service-c1fe57749018d20ae221 deploy /var/tempest/workspaces/default/deployments/pivotal-container-service-c1fe57749018d20ae221.yml”; Duration: 852s; Exit Status: 1
Exited with 1.

 

This now begs the question – how do you review those logs? Well, what you need to do is set up BOSH CLI, query the deployment and the VM ID for the PKS VM, and then open a shell to the VM. From the shell, you can query the status of the services/jobs and examine the logs. OK, let’s break it down into distinct steps:

 

1. Setup BOSH CLI
For those of you already familiar with PKS, you will be aware of the necessity to have a dedicated VM where multiple CLIs for BOSH, PKS and Kubernetes reside. I have already described how to set up the BOSH CLI in section 8 of this earlier post on PKS. The useful commands for querying deployment status are “bosh task” and “bosh vms“. Once my PKS deployment had failed as shown above, I was able to use them to query the state of the deployment.
cormac@pks-cli:~$ bosh vms
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Task 36. Done
Deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’
Instance                                                        Process State  AZ     IPs           VM CID                                   VM Type  Active
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa  failing        CH-AZ  192.50.0.141  vm-edc97e9f-79f7-4709-a2e3-0a64b1feb37c  large    true
1 vms
Succeeded
cormac@pks-cli:~$

 

Now, to actually do further troubleshooting, I can use the following command to open an SSH session to the VM (-e specifies the environment and -d the deployment; the last argument is the instance):

cormac@pks-cli:~$ bosh -e 192.50.0.140 -d pivotal-container-service-21ced5842e2c62f5dd13 ssh pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Using deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’
Task 37. Done
Unauthorized use is strictly prohibited. All access and activity
is subject to logging and monitoring.
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-43-generic x86_64)
* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/advantage
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
Last login: Mon Feb  4 17:12:46 2019 from 192.50.0.6
To run a command as administrator (user “root”), use “sudo <command>”.
See “man sudo_root” for details.
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~$

 

Now that we are in the PKS VM shell, we can become superuser, and query the state of the processes/jobs using the “monit summary” command. As you can see below, there is an issue with pks-nsx-t-osb-proxy which was also reported in output above. I read a few community posts that already stated that this job has an issue if certificates haven’t been setup correctly. Let’s see if we can figure out what the problem is.

pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~$ sudo su –
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~# monit summary
The Monit daemon 5.2.5 uptime: 22m
Process ‘pks-api’                   running
Process ‘broker’                    running
Process ‘pks-nsx-t-osb-proxy’       not monitored
Process ‘mariadb_ctrl’              running
Process ‘galera-healthcheck’        running
Process ‘gra-log-purger-executable’ running
Process ‘cluster_health_logger’     running
Process ‘uaa’                       running
Process ‘telemetry-server’          running
Process ‘bosh-dns’                  running
Process ‘bosh-dns-resolvconf’       running
Process ‘bosh-dns-healthcheck’      running
System ‘system_localhost’           running

 

Let’s examine the appropriate logs next:
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:~# cd  /var/vcap/sys/log/pks-nsx-t-osb-proxy/
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:/var/vcap/sys/log/pks-nsx-t-osb-proxy# ls -ltr
total 20
-rw-r–r– 1 root root     0 Feb  4 16:50 pks-nsx-t-osb-proxy.stdout.log
-rw-r–r– 1 root root    93 Feb  4 17:13 pks_nsx_t_osb_proxy_ctl.stderr.log
-rw-r–r– 1 root root  4020 Feb  4 17:13 pks_nsx_t_osb_proxy_ctl.stdout.log
-rw-r–r– 1 root root 10200 Feb  4 17:13 pks-nsx-t-osb-proxy.stderr.log
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa:/var/vcap/sys/log/pks-nsx-t-osb-proxy# tail pks-nsx-t-osb-proxy.stderr.log
time=”2019-02-04T16:50:55Z” level=error msg=”Failed to get node properties” pks-networking=networkManager
2019/02/04 16:50:55 Error initializing a NSX-T client: Error getting network manager for cluster Get https://10.27.51.114/api/v1/node: dial tcp 10.27.51.114:443: connect: network is unreachable
time=”2019-02-04T16:51:46Z” level=error msg=”Failed to get node properties” pks-networking=networkManager
2019/02/04 16:51:46 Error initializing a NSX-T client: Error getting network manager for cluster Get https://10.27.51.114/api/v1/node: dial tcp 10.27.51.114:443: connect: network is unreachable
time=”2019-02-04T16:52:36Z” level=error msg=”Failed to get node properties” pks-networking=networkManager

 

I realized at this point that my PKS VM, deployed on VLAN 50, needed access to the NSX Manager which was currently only on VLAN 51. At this point, I realised that I had to multi-home my NSX Manager so that communication between the NSX Manager and PKS could proceed (again, this is due to my funky network layout, and probably something you would not design for in real life scenarios). Next time through the loop, when I examined the logs, I had the same failure reported in the PKS logs, but now a different message in the pks-nsx-t-osb-proxy.stderr.log.
2019/02/04 17:29:36 Error initializing a NSX-T client: Error getting network manager for cluster Get https://192.50.0.5/api/v1/node: x509: certificate is valid for 10.27.51.114, not 192.50.0.5
time=”2019-02-04T17:30:28Z” level=error msg=”Failed to get node properties” pks-networking=networkManager
2019/02/04 17:30:28 Error initializing a NSX-T client: Error getting network manager for cluster Get https://192.50.0.5/api/v1/node: x509: certificate is valid for 10.27.51.114, not 192.50.0.5
time=”2019-02-04T17:31:16Z” level=error msg=”Failed to get node properties” pks-networking=networkManager

 

Ugh – so I had built my NSX Manager cert for the VLAN 51 interface (IP Address), but of course I needed to have them built for the VLAN 50 interface (different IP Address). I rebuilt a new cert, re-added it to the NSX Manager, re-added it to the PKS tile AND the BOSH tile in the PCF Ops Manager UI. This time everything worked when I redeployed. The whole procedure for creating and registering certs can be found in the PKS documentation here.
cormac@pks-cli:~$ bosh task
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Task 127
Task 127 | 08:48:46 | Preparing deployment: Preparing deployment (00:00:06)
Task 127 | 08:49:06 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 127 | 08:49:06 | Updating instance pivotal-container-service: pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa (0) (canary) (00:02:12)
Task 127 Started  Tue Feb  5 08:48:46 UTC 2019
Task 127 Finished Tue Feb  5 08:51:18 UTC 2019
Task 127 Duration 00:02:32
Task 127 done
Succeeded
cormac@pks-cli:~$ bosh vms
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Task 132. Done
Deployment ‘pivotal-container-service-21ced5842e2c62f5dd13’
Instance                                                        Process State  AZ     IPs           VM CID                                   VM Type  Active
pivotal-container-service/765a0aeb-1773-419f-8465-dabfce0d0ffa  running        CH-AZ  192.50.0.141  vm-edc97e9f-79f7-4709-a2e3-0a64b1feb37c  large    true
1 vms
Succeeded
cormac@pks-cli:~$

 

Sweet. Now I’m ready to build some Kubernetes clusters.