PKS – Networking Setup Tips and Tricks

In my previous post, I showed how to deploy Pivotal Container Services (PKS) on a simplified flat network. In this post, I will highlight some of the issues one might encounter if you wish to deploy PKS on a more complex network topology. For example, you may have vCenter Server on a vSphere management network alongside the PKS management components (PKS  CLI client, Pivotal Ops Manager). You may then want to have another “intermediate network” for the deployment of the BOSH and PKS VMs. And then finally, you may finally have another network on which the Kubernetes (K8s) VMs (master, workers) are deployed. These component need to communicate to each other across the different networks, e.g. the bosh agent on the K8s master and worker VMs needs to be able to reach the vSphere infrastructure. What I want to highlight in this post are some of the issues and error messages that you might encounter when rolling out PKS on such a configuration, and what you can do to fix them. Think of this as a lessons learnt by me trying to do something similar.

A picture is worth a thousand words, so a final PKS deployment may look something similar to this layout here:

Let’s now look at what happens when certain components in this deployment cannot communicate/route to other components.

Issue #1: This is the error I observed when trying to deploy the PKS VM via the Pivotal Ops Manager on a network which could not route to my vSphere network. Note this is the Pivotal Container Service PKS VM (purple above) and not the PKS client with the CLI tools (orange above).

===== 2018-04-19 10:45:02 UTC Finished “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 update-config runtime –name=pivotal-container-service-9b9223d27659ed342925-enable-pks-helpers /tmp/pivotal-container-service-9b9223d27659ed342925-enable-pks-helpers.yml20180419-1433-1sp8xkh”; Duration: 0s; Exit Status: 0
===== 2018-04-19 10:45:02 UTC Running “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 upload-stemcell /var/tempest/stemcells/bosh-stemcell-3468.28-vsphere-esxi-ubuntu-trusty-go_agent.tgz”
Using environment ‘192.168.191.10’ as client ‘ops_manager’
0.00%    0.54% 11.16 MB/s 36s
Task 5
Task 5 | 10:45:35 | Update stemcell: Extracting stemcell archive (00:00:04)
Task 5 | 10:45:39 | Update stemcell: Verifying stemcell manifest (00:00:00)
Task 5 | 10:52:12 |Error: Unknown CPI error ‘Unknown’ with message ‘Please make sure the CPI has proper network access to vSphere. (HTTPClient::ConnectTimeoutError: execution expired)’ in ‘info’ CPI method
 Task 5 Started  Thu Apr 19 10:45:35 UTC 2018
Task 5 Finished Thu Apr 19 10:52:12 UTC 2018
Task 5 Duration 00:06:37
Task 5 error
Uploading stemcell file:
Expected task ‘5’ to succeed but state is ‘error’
Exit code 1
RESOLUTION #1: The clue is in the message – “proper network access to vSphere”. CPI is short for Cloud Provider Interface, and is basically how PKS communicates to different deployment types, in this case vSphere. To avoid this issue, you need to make sure that BOSH and PKS can communicate to your vCenter server/vSphere management network.

 

ISSUE #2: This next issue was to do with not being able to resolve fully qualified domain names. If DNS has not been configured correctly when you are setting up the network section of the manifests in Pivotal Ops Manager, then PKS will not be able resolve ESXi hostnames in your vSphere environment. I’m guessing that the upload-stemcell command which is getting an error here is where it is trying to upload the customized operating system image for the PKS VM to vSphere. But it is unable to resolve the FQDN to something more meaningful.

===== 2018-04-19 11:23:06 UTC Finished “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 upload-stemcell /var/tempest/stemcells/bosh-stemcell-3468.28-vsphere-esxi-ubuntu-trusty-go_agent.tgz”; Duration: 198s; Exit Status: 1

===== 2018-04-19 11:23:06 UTC Running “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 upload-stemcell /var/tempest/stemcells/bosh-stemcell-3468.28-vsphere-esxi-ubuntu-trusty-go_agent.tgz”

Using environment ‘192.168.191.10’ as client ‘ops_manager’

0.00%    0.51% 10.62 MB/s 38s

Task 14

Task 14 | 11:23:39 | Update stemcell: Extracting stemcell archive (00:00:04)

Task 14 | 11:23:43 | Update stemcell: Verifying stemcell manifest (00:00:00)

Task 14 | 11:23:45 | Update stemcell: Checking if this stemcell already exists (00:00:00)

Task 14 | 11:23:45 | Update stemcell: Uploading stemcell bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28 to the cloud (00:00:28)

L Error: Unknown CPI error ‘Unknown’ with message ‘getaddrinfo: Name or service not known (esxi-dell-g.rainpole.com:443)’ in ‘create_stemcell’ CPI method

Task 14 | 11:24:13 | Error: Unknown CPI error ‘Unknown’ with message ‘getaddrinfo: Name or service not known (esxi-dell-g.rainpole.com:443)’ in ‘create_stemcell’ CPI method

Task 14 Started  Thu Apr 19 11:23:39 UTC 2018

Task 14 Finished Thu Apr 19 11:24:13 UTC 2018

Task 14 Duration 00:00:34

Task 14 error

Uploading stemcell file:

Expected task ’14’ to succeed but state is ‘error’

Exit code 1

RESOLUTION #2: Ensure that the DNS server entries in the network sections of the manifests in Ops Manager are correct so that both BOSH and PKS can resolve vCenter and ESXi host FQDNs.

 

ISSUE #3: The K8s master and worker VMs deploy but never enter a running state – they get left in a state of unresponsive agent. I used two very useful commands here on the PKS Client to troubleshoot. This PKS Client (orange above) is not the PKS VM (purple above), but the VM where I have my CLI tools deployed (see previous post for more info). One command is bosh vms and the other is bosh task. The first, bosh vms, shows me the current state of deployed VMs, including the K8s VMs, and bosh task command tracks the K8s cluster deploy tasks. As you can see, the deployment gives up after 10 minutes/600 seconds.

root@pks-cli:~# bosh vms

Using environment ‘10.27.51.181’ as client ‘ops_manager’

Task 53

Task 54. Done

Task 53 done

Deployment ‘pivotal-container-service-e7febad16f1bf59db116’

Instance                                                        Process State  AZ     IPs           VM CID                                   VM Type  Active

pivotal-container-service/d4a0fd19-e9ce-47a8-a7df-afa100a612fa  running        CH-AZ  10.27.51.182  vm-68aadcae-ba47-41e8-843a-fb3764670861  micro    false

1 vms

Deployment ‘service-instance_2214fcfa-c02f-498f-b37b-3a1b9cf89b27’

Instance                                     Process State       AZ     IPs           VM CID                                   VM Type  Active

master/4ed5f285-5c89-4740-b8bf-32682137cab6  unresponsive agent  CH-AZ  192.50.0.140  vm-75109d0d-5581-4cf9-9dcc-d873e9602b9b  –        false

worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1  unresponsive agent  CH-AZ  192.50.0.141  vm-5e882aff-f709-4b7b-ab47-8d6be80cb7dd  –        false

worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1  unresponsive agent  CH-AZ  192.50.0.142  vm-e0b6dca3-cd92-4e0a-8429-9a2fe2a2dc56  –        false

worker/e36af3d7-e0cd-4c23-88e6-adde3f554300  unresponsive agent  CH-AZ  192.50.0.143  vm-cfd0c81e-9811-49cb-9c87-e23063f83a6b  –        false

4 vms

Succeeded

root@pks-cli:~#

 

root@pks-cli:~# bosh task

Using environment ‘10.27.51.181’ as client ‘ops_manager’

Task 48

Task 48 | 15:52:13 | Preparing deployment: Preparing deployment (00:00:05)

Task 48 | 15:52:30 | Preparing package compilation: Finding packages to compile (00:00:00)

Task 48 | 15:52:30 | Creating missing vms: master/4ed5f285-5c89-4740-b8bf-32682137cab6 (0)

Task 48 | 15:52:30 | Creating missing vms: worker/e36af3d7-e0cd-4c23-88e6-adde3f554300 (1)

Task 48 | 15:52:30 | Creating missing vms: worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1 (0)

Task 48 | 15:52:30 | Creating missing vms: worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1 (2)

Task 48 | 16:02:53 | Creating missing vms: worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1 (0) (00:10:23)

L Error: Timed out pinging to 8876d9df-290f-41b9-8455-1c8efe5fc05d after 600 seconds

Task 48 | 16:02:58 | Creating missing vms: worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1 (2) (00:10:28)

L Error: Timed out pinging to 893ccb7a-11d8-4055-b486-f435f922954c after 600 seconds

Task 48 | 16:02:58 | Creating missing vms: master/4ed5f285-5c89-4740-b8bf-32682137cab6 (0) (00:10:28)

L Error: Timed out pinging to 4741eb79-ca75-4352-ba8e-d70474c7beb8 after 600 seconds

Task 48 | 16:03:00 | Creating missing vms: worker/e36af3d7-e0cd-4c23-88e6-adde3f554300 (1) (00:10:30)

L Error: Timed out pinging to 0315851b-fdd3-48b5-9415-75d2bf52c945 after 600 seconds

Task 48 | 16:03:00 | Error: Timed out pinging to 8876d9df-290f-41b9-8455-1c8efe5fc05d after 600 seconds

Task 48 Started  Fri Apr 20 15:52:13 UTC 2018

Task 48 Finished Fri Apr 20 16:03:00 UTC 2018

Task 48 Duration 00:10:47

Task 48 error

Capturing task ’48’ output:

Expected task ’48’ to succeed but state is ‘error’

Exit code 1

root@pks-cli:~#

RESOLUTION #3: BOSH agents in the Kubernetes VMs needs to be to communicate back to BOSH VM, so there needs to be a route between Kubernetes VM deployed on the “Service Network” – the network that is configured in the BOSH manifest and consumed in the PKS manifest in Ops Manager – and the “intermediate network” on which BOSH and PKS VMs are deployed. If there is no route between the networks, then this is what you will observe.

 

ISSUE #4: In this final issue, the K8s cluster does not deploy successfully. The master and worker VMs are running but the first worker VM from K8s never restarts after stopping for the canary step. The canary step is where a duplicate master or worker node is updated with any necessary configuration/components/software, and if the update is successful, it replaces the current master or worker node. In this example, we are looking at the task after the failure, again using bosh task. If you give the task number to bosh task, it will list the task steps, as shown below:

root@pks-cli:~# bosh task 31

Using environment ‘192.168.191.10’ as client ‘ops_manager’

Task 31

Task 31 | 12:00:21 | Preparing deployment: Preparing deployment (00:00:06)

Task 31 | 12:00:40 | Preparing package compilation: Finding packages to compile (00:00:00)

Task 31 | 12:00:40 | Creating missing vms: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0)

Task 31 | 12:00:40 | Creating missing vms: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0)

Task 31 | 12:00:40 | Creating missing vms: worker/363b9529-d7f7-4d64-a389-84a9a13fcc91 (2)

Task 31 | 12:00:40 | Creating missing vms: worker/8908ebd2-d28f-4a9c-b184-c5379fa35824 (1) (00:01:10)

Task 31 | 12:02:01 | Creating missing vms: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0) (00:01:21)

Task 31 | 12:02:03 | Creating missing vms: worker/363b9529-d7f7-4d64-a389-84a9a13fcc91 (2) (00:01:23)

Task 31 | 12:02:04 | Creating missing vms: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0) (00:01:24)

Task 31 | 12:02:11 | Updating instance master: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0) (canary) (00:02:02)

Task 31 | 12:04:13 | Updating instance worker: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0) (canary) (00:02:29)

L Error: Action Failed get_task: Task 7af7fdd9-fa53-4dc7-5b2a-6c9de2e7df3c result: 1 of 4 pre-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns-enable, bosh-dns, syslog_forwarder.

Task 31 | 12:06:42 | Error: Action Failed get_task: Task 7af7fdd9-fa53-4dc7-5b2a-6c9de2e7df3c result: 1 of 4 pre-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns-enable, bosh-dns, syslog_forwarder.

Task 31 Started  Thu Apr 19 12:00:21 UTC 2018

Task 31 Finished Thu Apr 19 12:06:42 UTC 2018

Task 31 Duration 00:06:21

Task 31 error

Capturing task ’31’ output:

Expected task ’31’ to succeed but state is ‘error’

Exit code 1

root@pks-cli:~#

In this case, because the K8s VMs are running, we can actually log onto the K8s VM and see if we can figure out why it failed by looking at the logs. There are 3 steps to this. First we use a new bosh command, bosh deployments.

Step 4.1 – get list of deployments via BOSH CLI and locate the service instance

root@pks-cli:~# bosh deployments

Using environment ‘192.168.191.10’ as client ‘ops_manager’

Name                                                   Release(s)                          Stemcell(s)                                       

Team(s)                                         Cloud Config

pivotal-container-service-9b9223d27659ed342925         bosh-dns/1.3.0                      bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28  

–                                               latest

cf-mysql/36.10.0

docker/30.1.4

kubo/0.13.0

kubo-etcd/8

kubo-service-adapter/1.0.0-build.3

on-demand-service-broker/0.19.0

pks-api/1.0.0-build.3

pks-helpers/19.0.0

pks-nsx-t/0.1.6

syslog-migration/10

uaa/54

service-instance_20474001-494e-43b1-aca4-ab8f788078b6  bosh-dns/1.3.0                      bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28  pivotal-container-service-9b9223d27659ed342925  latest

docker/30.1.4

kubo/0.13.0

kubo-etcd/8

pks-helpers/19.0.0

pks-nsx-t/0.1.6

syslog-migration/10

2 deployments

Succeeded

 

Step 4.2 – Open an SSH session to the first worker on your K8s cluster, worker/0

Once the service instance is located, we can specify that deployment in the bosh command, and request SSH access to one of the VMs in the K8s cluster, in this case the first worker which is identified as worker/0.

root@pks-cli:~# bosh -d service-instance_20474001-494e-43b1-aca4-ab8f788078b6 ssh worker/0

Using environment ‘192.168.191.10’ as client ‘ops_manager’

Using deployment ‘service-instance_20474001-494e-43b1-aca4-ab8f788078b6’

Task 130. Done

Unauthorized use is strictly prohibited. All access and activity

is subject to logging and monitoring.

Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-116-generic x86_64)

* Documentation:  https://help.ubuntu.com/

The programs included with the Ubuntu system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

applicable law.

Last login: Thu Apr 19 15:03:05 2018 from 192.168.192.131

To run a command as administrator (user “root”), use “sudo <command>”.

See “man sudo_root” for details.

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~$

 

Step 4.3 – Examine the log files of the worker

The log files we are interested in is /var/cap/sys/log/kubetlet/*.log as it is the kubelet component which failed during the previous canary step. You will need superuser privileges to view this file, so simply sudo su – to get that. I’ve truncated the log file here, fyi.

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log$ sudo su –

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~# pwd

/root

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~# cd /var/vcap/sys/log/kubelet/

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# ls -ltr

total 8

-rw-r—– 1 root root   21 Apr 19 12:24 pre-start.stdout.log

-rw-r—– 1 root root 2716 Apr 19 12:25 pre-start.stderr.log

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# cat pre-start.stdout.log

rpcbind stop/waiting

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# cat pre-start.stderr.log

+ CONF_DIR=/var/vcap/jobs/kubelet/config

+ PKG_DIR=/var/vcap/packages/kubernetes

+ source /var/vcap/packages/kubo-common/utils.sh

+ main

+ detect_cloud_config

<—snip —>

+ export GOVC_DATACENTER=CH-Datacenter

+ GOVC_DATACENTER=CH-Datacenter

++ cat /sys/class/dmi/id/product_serial

++ sed -e ‘s/^VMware-//’ -e ‘s/-/ /’

++ awk ‘{ print tolower($1$2$3$4 “-” $5$6 “-” $7$8 “-” $9$10 “-” $11$12$13$14$15$16) }’

+ local vm_uuid=423c6dcf-d47b-53a3-5a1e-2251d6bdc4b7

+ /var/vcap/packages/govc/bin/govc vm.change -e disk.enableUUID=1 -vm.uuid=423c6dcf-d47b-53a3-5a1e-2251d6bdc4b7

/var/vcap/packages/govc/bin/govc: Post https://10.27.51.106:443/sdk: dial tcp 10.27.51.106:443: i/o timeout

worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet#

RESOLUTION #4: In this example, we see the K8s worker node getting and i/o timeout while trying to communicate with my vCenter server (that is my VC IP that I added to the PKS manifest in Pivotal Operations Manager in the Kubernetes Cloud Provider section). This access is required by the K8s VMs to manage/create/delete persistent volumes as VMDKs for the application containers that will run on K8s. In this case, the K8s cluster was deployed on a network segment that allowed it to communicate to BOSH/PKS VMs, but not to the vCenter Server/vSphere environment.

[Update] This is can also happen is there is no route between the K8s cluster network segment and the vCenter server. For example, you may have a multi-home vCenter server, but the default route could be on the other interface (not the one which has connectivity to the K8s cluster network segment. In that case, it might be simply a matter of adding a new default route on vCenter for that network. Of course, this could have ramifications for other things, so proceed with caution.

 

Other useful things to know – how to log into BOSH and PKS VMs

We have seen how we can access our K8s VMs if we need to troubleshoot but what about the BOSH and PKS VM. This is quite straight forward. What you will need to do is login the Pivotal Operations Manager, click on the tile of the VM that you wish to login to, select credentials, and from there you can get a login to a shell on each of the VMs. Login as user vcap, supply the password retrieved from the Ops Manager and then sudo if you need superuser privileges.

Here is where to get them for BOSH Director:

Here is where to get them for PKS/Pivotal Container Service:

2 Replies to “PKS – Networking Setup Tips and Tricks”

  1. thanks for the article,
    I ran into issue 3 that pinging timed out during k8s cluster creation;
    however the k8s node network is routable to bosh/pks/ops manager; manual ping to k8s nodes during creation is successful;

    any tips?

    1. Are you able to log onto the K8s master or workers? If so, SSH to them, and try to ping from within the master and workers. Could be gateway settings, or some other misconfig perhaps.

Comments are closed.