Site icon CormacHogan.com

PKS and NSX-T: Error: Timed out pinging after 600 seconds

I’m still playing with PKS 1.3 and NSX-T 2.3.1 in my lab. One issue that I kept encountering was that when on deploying my Kubernetes cluster, my master and worker nodes kept failing with a “timed out” trying to do a ping. A bosh task command showed the errors, as shown here.

cormac@pks-cli:~$ bosh task
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Task 845
Task 845 | 16:56:36 | Preparing deployment: Preparing deployment
Task 845 | 16:56:37 | Warning: DNS address not available for the link provider instance: pivotal-container-service/0c23ed00-d40a-4bfe-abee-1c
Task 845 | 16:56:37 | Warning: DNS address not available for the link provider instance: pivotal-container-service/0c23ed00-d40a-4bfe-abee-1c
Task 845 | 16:56:37 | Warning: DNS address not available for the link provider instance: pivotal-container-service/0c23ed00-d40a-4bfe-abee-1c
Task 845 | 16:56:49 | Preparing deployment: Preparing deployment (00:00:13)
Task 845 | 16:57:24 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 845 | 16:57:24 | Creating missing vms: master/f46b8a5f-f864-4217-8b54-753f535e69d8 (0)
Task 845 | 16:57:24 | Creating missing vms: worker/9b58c1ac-8fcc-41db-9e52-8e8deb4e944b (0)
Task 845 | 16:57:24 | Creating missing vms: worker/e61b4262-5bcc-4567-bd1c-fea7058a743f (2)
Task 845 | 16:57:24 | Creating missing vms: worker/8982a101-05ea-416d-ba9c-b6e6a98059cf (1)
Task 845 | 17:08:42 | Creating missing vms: worker/9b58c1ac-8fcc-41db-9e52-8e8deb4e944b (0) (00:11:18)
L Error: Timed out pinging to 320d032a-60e0-4c8c-bfa5-bd94e1d0ae13 after 600 seconds
Task 845 | 17:08:42 | Creating missing vms: worker/8982a101-05ea-416d-ba9c-b6e6a98059cf (1) (00:11:18)
L Error: Timed out pinging to dbc1d5cf-3e2a-4618-86d2-280172fb3ffd after 600 seconds
Task 845 | 17:08:43 | Creating missing vms: worker/e61b4262-5bcc-4567-bd1c-fea7058a743f (2) (00:11:19)
L Error: Timed out pinging to 54342511-1768-48b9-8009-27979ccf2542 after 600 seconds
Task 845 | 17:08:51 | Creating missing vms: master/f46b8a5f-f864-4217-8b54-753f535e69d8 (0) (00:11:27)
L Error: Timed out pinging to 2f86ef1a-9636-46ed-993c-d2f11811eae0 after 600 seconds
Task 845 | 17:08:51 | Error: Timed out pinging to 320d032a-60e0-4c8c-bfa5-bd94e1d0ae13 after 600 seconds
Task 845 Started Thu Feb 7 16:56:36 UTC 2019
Task 845 Finished Thu Feb 7 17:08:51 UTC 2019
Task 845 Duration 00:12:15
Task 845 error
Capturing task ‘845’ output:
Expected task ‘845’ to succeed but state is ‘error’
Exit code 1
cormac@pks-cli:~$

Another symptom is that the VMs that made up my Kubernetes cluster deployment never appeared in the bosh vms output:

$ bosh vms
Using environment ‘192.50.0.140’ as client ‘ops_manager’
Deployment ‘harbor-container-registry-f47eb4b10aee9c980be9’
Instance                                         Process State  AZ     IPs           VM CID                                   VM Type     Active
harbor-app/2d2a3972-1e5f-47f6-847e-87719999a3f0  running        CH-AZ  192.50.0.142  vm-2529abc5-95a0-4ba6-b5d9-e35ab87d1dd8  large.disk  true
1 vms
 
Deployment ‘pivotal-container-service-766248e89e8a6c67ebd9’
Instance                                                        Process State  AZ     IPs           VM CID                                   VM Type     Active
pivotal-container-service/f96832cb-7e12-47e9-89e1-07d897859a0c  running        CH-AZ  192.50.0.141  vm-e725c0e4-c9f6-42fe-9059-e29287f0a54c  large.disk  true
1 vms
 
Deployment ‘service-instance_b9524c17-4129-45ff-8eb3-36c805826e15’
Instance  Process State  AZ  IPs  VM CID  VM Type  Active
0 vms
Succeeded

I eventually traced it to an MTU configuration size in the Edge Uplink Profile. Because my edge was connected to a trunk port and not a particular VLAN, I should have been using an MTU size of 1600. Instead, I had an MTU size of 1500. The Uplink Profile is selected as part of the Edge > Transport Node > N-VDS setup. Once I corrected that mis-configuration, the K8s cluster deployment managed to proceed beyond this point.

Here is a view before we made the changes, when the Kubernetes cluster deployments were failing:

Here is how the issue was resolved:

I also heard from a customer that they saw a similar issue when the MTU size was mis-configured on the underlying VSS standard switch or VDS distributed switch to which the Edge is connected. So take a look at your MTU sizes if you run into this issue – it may be the problem.

Here is  an excellent description on why this requirement is needed on the NSX-T Edge from my colleague Francis Guillier:

The Edge TN has 3 vNIC:

  1. vNIC for Edge Management.
  2. vNIC for Edge Uplink (connected to physical router).
  3. vNIC for VTEP (Virtual Tunnel Endpoint). This is what enables hypervisor hosts to participate in an NSX-T overlay. The VTEP is the connection point at which the encapsulation and decapsulation takes place.

The Edge Transport Node on NSX-T manager allows you specify the uplink property for interfaces 2 and 3 via an Uplink Profile for each interface.

The uplink profile for interface can be set to 1500 or greater value (1600 for instance). The reason is that the Edge Uplink interface is connected to a physical router via a VLAN and this VLAN is by default set to MTU=1500 on a physical switch. However, some organizations like to enable jumbo frame everywhere and if this is the case, the MTU on this interface can be bumped up to 9000 bytes. The key here is the MTU for the Edge Uplink should be set to the same value as the one set on the physical switch for the VLAN that interconnects the Edge Uplink to the physical router.

For the VTEP interface, NSX-T uses GENEVE encapsulation tunnel and it requires 1600 bytes. There is no choice here, the MTU must be set to a minimum of 1600 bytes. Any greater value is OK as long as it is > 1600 bytes. That’s why the Uplink Profile applied for this interface should be set to 1600 bytes (or greater).

Exit mobile version