Getting started with VCF Part 5 – Commission Hosts

At this stage, we’ve done quite a number of tasks related to VMware Cloud Foundation (VCF). Our management domain is up and running, and we also have the vRealize Suite of products deployed (vRealize Suite Lifecycle Manager, vRealize Log Insight, vRealize Operations Manager, and of course vRealize Automation). Our next step is to commission some new ESXi hosts so we can create our very first VI Workload Domain (WLD) which we can start using for production purposes. This post will look at the steps involved in commissioning the hosts.

Note that in this example, I am going to commission ESXi 3 hosts, and I am also going to select vSAN for my preferred storage. This will allow me to build RAID-1 protected virtual machines and objects, which places a copy of the data on the storage of two of the ESXi hosts, and a witness copy on the storage of a third host. This means that even if there is a host failure in the WLD, my data and virtual machines remains available. My one recommendation for production environments is to consider an additional host for maintenance/self-healing purposes. vSAN can self-heal, meaning components that were lost during a failure can be rebuilt on the remaining hosts in the cluster (assuming there are enough resources available). So for a RAID-1 requirement, consider 4 hosts. For a RAID-5 requirement (which needs 4 hosts to implement a 3 data + 1 parity configuration), consider a fifth host for maintenance/self-healing. And finally for RAID-6 (which needs 6 hosts – 4 data + 2 parity ), consider a 7th host in the cluster for the reasons highlighted.

Create a Network Pool

My first step is to create a network pool for the hosts that I am about to commission. In my network pool, I will need to add two pieces of information for my hosts that are about to be commissioned:

  1. vMotion Network Information
  2. vSAN Network Information

You will need to provide information such as VLAN ID, MTU Size, default gateway and a range of IP addresses that can be used on the network segments. Here is what I created for my lab:

Verify host prerequisites

There are a lot of prerequisites before you can commission your ESXi hosts. Firstly, these hosts need to have 2 x 10Gb NICs available, and one of these must be attached to a standard vswitch before commissioning. You must also ensure that the correct version of ESXi software is installed in the host. Obviously the availability of DNS, and the ability to do forward and reverse lookup of each host is a must. You should also have SSH enabled on the hosts.

Note that if you plan to use this host for vSAN, you need to ensure that all partitions are removed from the devices that you plan to use for vSAN cache and capacity.

Here is the full list of pre-reqs for an ESXi host to be commissioned:

Begin Commissioning Hosts

Once all the prerequisites have been met, we can begin to commission the hosts. In the VMware Cloud Foundation SDDC Manager UI, select Inventory > Hosts. In the top right hand corner, there is a button called “Commission Hosts“. Click this button to begin adding hosts for validation before commissioning. From here, host FQDNs are populated, their storage type is selected, and usernames and passwords are added. At this point, the network pool created earlier is also selected. In the example below, I validated the first host (esxi-dell-g.rainpole.com) on its own first. This is why it appears as valid. I have now added two additional hosts, then selected them, and now I am ready to do a “Validate All” on the three hosts.

You will have to confirm the fingerprint before doing the validate. This is done by checking the grey arrow next to each fingerprint to turn it green. I have done this step above. You can then proceed to select all of the hosts and clicking the validate all button (which will change from grey to blue once the fingerprints have been confirmed). The validation status will return either “Valid” or “Invalid”. Unfortunately the UI does not tell you the reason for a validation failing – I will show you how you can determine the reason shortly.

In my lab, all hosts have now successfully validated – we can proceed with the commissioning of the hosts by clicking on “Next”. The last step is to review the host information and press the commission button.

You can track the progress of the commission task in the SDDC Manager Hosts list (navigate to Inventory > Hosts). The tasks appear in the lower half of the screen. Click on the task to see the sub-tasks. This can then be enlarged to see status of all the sub-tasks associated with the commissioning of the host, as shown below:

Hosts should begin to appear with a Configuration Status of Activating:

If no further issues are encountered, the hosts should get successfully commissioned and show a Configuration Status of Active:

Where to look if validate fails?

As mentioned earlier, if the validate step fails, SDDC Manager won’t tell you what the reason for the failure was in the UI. Instead, to determine what caused a validate to fail, you will need to SSH to the SDDC Manager, and monitor the operationsmanager.log file found in /var/log/vmware/vcf/operationsmanager. There can be multiple reasons for a validate to fail. Here are a few which I caught in my lab environment:

1. DNS not working

Here is an example which DNS resolution wasn’t working correctly.

2020-01-14T15:26:35.148+0000 DEBUG [de268ca85f284847,f5a7]\
 [c.v.v.v.p.s.VSpherePluginService,om-exec-22] Host name esxi-dell-g . isHybrid false

2020-01-14T15:26:35.148+0000 WARN  [de268ca85f284847,f5a7]\
 [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,om-exec-22] Shutting down\
 the connection monitor.

2020-01-14T15:26:35.148+0000 ERROR [de268ca85f284847,f5a7]\
 [c.v.v.h.c.s.i.CommissionHostsNetworkValidator,om-exec-22]Host name esxi-dell-g is not same as FQDN\
 esxi-dell-g.rainpole.com for the host 10.27.51.7

2020-01-14T15:26:35.148+0000 ERROR [de268ca85f284847,f5a7]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-22] Host validation failed for Host:\
 esxi-dell-g.rainpole.com

2020-01-14T15:26:35.148+0000 WARN  [0000000000000000,0000]\
 [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,VLSI-client-connection-monitor-2444]\
 Interrupted, no more connection pool cleanups will be performed.

2020-01-14T15:26:35.148+0000 DEBUG [de268ca85f284847,6973]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-18]esxi-dell-g.rainpole.com: HOST_NAME_NOT_VALID

2020-01-14T15:26:35.148+0000 DEBUG [de268ca85f284847,6973]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-18] Completed validating Host(s).

2. Network Uplink/Speed/vSwitch Configuration

This is the log output when the host was attached to a 1Gb NIC and not a 10Gb NIC:
2020-01-14T15:32:43.744+0000 DEBUG [de268ca85f284847,0dbf]\
 [c.v.v.h.c.s.i.CommissionHostsHardwareValidator,om-exec-16] In isNicSwitchUplink, vSwitch: 'vSwitch0',\
 Uplinks: '[vmnic2, vmnic3]'.

2020-01-14T15:32:43.749+0000 DEBUG [de268ca85f284847,0dbf]\
 [c.v.v.h.c.s.i.CommissionHostsHardwareValidator,om-exec-16] In isNicSwitchUplink, vSwitch: 'vSwitch0',\
 Uplinks: '[vmnic2, vmnic3]'.

2020-01-14T15:32:43.759+0000 ERROR [de268ca85f284847,0dbf]\
 [c.v.v.h.c.s.i.CommissionHostsHardwareValidator,om-exec-16]esxi-dell-g.rainpole.com: vSphere standard\
 switch vSwitch0's Uplinks: []. vSwitch0 must have only one NIC as Uplink.

2020-01-14T15:32:43.759+0000 ERROR [de268ca85f284847,0dbf]\
 [c.v.v.h.c.s.i.CommissionHostsHardwareValidator,om-exec-16] Hardware validation for host\
 esxi-dell-g.rainpole.com failed.

2020-01-14T15:32:43.759+0000 ERROR [de268ca85f284847,0dbf]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-16] Host validation failed for Host:\
 esxi-dell-g.rainpole.com

2020-01-14T15:32:43.759+0000 DEBUG [de268ca85f284847,4fcf]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-7] esxi-dell-g.rainpole.com:\
HOST_STANDARD_SWITCH_VALIDATION_FAILED

2020-01-14T15:32:43.759+0000 DEBUG [de268ca85f284847,4fcf]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-7] Completed validating Host(s).

3. No eligible vSAN disks

This is what was observed when the partition tables were not removed from the disks that I wanted to use for vSAN during host validation. I connected to each host using the host client, and manually removed the partitions from each device. The validation task then continued past this point.

2020-01-21T09:22:06.034+0000 DEBUG [a543b3db4ddf4cdb,e864]\
 [c.v.evo.sddc.common.util.SshUtil,om-exec-14] End of execution of command [vdq -q], Status: 0

Output: [
   {
      "Name"     : "naa.500a07510f86d693",
      "VSANUUID" : "525127ad-f017-4d9a-a767-a12ad97a4bca",
      "State"    : "Ineligible for use by VSAN",
      "Reason"   : "Not mounted on this host",
      "IsSSD"    : "1",
"IsCapacityFlash": "1",
      "IsPDL"    : "0",
      "Size(MB)" : "763097",
   "FormatType" : "512e",
   },

   {
      "Name"     : "naa.500a07510f86d69d",
      "VSANUUID" : "5229da32-dbe4-3f15-b5c2-8721c2c07aa7",
      "State"    : "Ineligible for use by VSAN",
      "Reason"   : "Not mounted on this host",
      "IsSSD"    : "1",
"IsCapacityFlash": "1",
      "IsPDL"    : "0",
      "Size(MB)" : "763097",
   "FormatType" : "512e",
   },

   {
      "Name"     : "naa.5001e82002675164",
      "VSANUUID" : "52e65b4a-123f-4771-6377-60008b7724c3",
      "State"    : "Ineligible for use by VSAN",
      "Reason"   : "Not mounted on this host",
      "IsSSD"    : "1",
"IsCapacityFlash": "0",
      "IsPDL"    : "0",
      "Size(MB)" : "190782",
   "FormatType" : "512n",
   },

   {
      "Name"     : "mpx.vmhba32:C0:T0:L0",
      "VSANUUID" : "",
      "State"    : "Ineligible for use by VSAN",
      "Reason"   : "Has partitions",
      "IsSSD"    : "0",
"IsCapacityFlash": "0",
      "IsPDL"    : "0",
      "Size(MB)" : "15280",
   "FormatType" : "512n",
   },

   {
      "Name"     : "naa.624a9370d4d78052ea564a7e00011522",
      "VSANUUID" : "",
      "State"    : "Ineligible for use by VSAN",
      "Reason"   : "Has partitions",
      "IsSSD"    : "1",
"IsCapacityFlash": "0",
      "IsPDL"    : "0",
      "Size(MB)" : "1",
   "FormatType" : "512n",
   },

]
Error output:
Command timed out: false

2020-01-21T09:22:08.413+0000 ERROR [0000000000000000,0000]\
 [c.v.e.s.c.c.v.vsan.VsanManagerBase,pool-75-thread-1]Unable to get vSAN disk\
 mapping for host MOR host-73

4. ESXi host with VIBs

In this example, the ESXi host was previously part of an NSX-T Transport Zone, thus it had a bunch of NSX-T 2.3.1 VIBs still installed. These NSX-T VIBs need to be removed for the hosts validation to succeed.

2020-01-21T11:59:15.970+0000 ERROR [a543b3db4ddf4cdb,000e]\
 [c.v.e.s.c.f.util.ValidationUtilImpl,om-exec-16] Found vibs [nsx-aggservice, nsx-cli-libs,\
 nsx-common-libs, nsx-da, nsx-esx-datapath, nsx-exporter, nsx-host, nsx-metrics-libs, nsx-mpa,\
 nsx-nestdb-libs, nsx-nestdb, nsx-netcpa, nsx-opsagent, nsx-platform-client, nsx-profiling-libs,\
 nsx-proxy, nsx-python-gevent, nsx-python-greenlet, nsx-python-logging, nsx-python-protobuf,\
 nsx-rpc-libs, nsx-sfhc, nsx-shared-libs, nsxcli] on host 10.27.51.7

2020-01-21T11:59:15.970+0000 ERROR [a543b3db4ddf4cdb,000e]\
 [c.v.v.h.c.s.i.CommissionHostsVibValidationService,om-exec-16] Host esxi-dell-g.rainpole.com,\
 has disallowed vibs - [nsx-aggservice, nsx-cli-libs, nsx-common-libs, nsx-da, nsx-esx-datapath,\
 nsx-exporter, nsx-host, nsx-metrics-libs, nsx-mpa, nsx-nestdb-libs, nsx-nestdb, nsx-netcpa,\
 nsx-opsagent, nsx-platform-client, nsx-profiling-libs, nsx-proxy, nsx-python-gevent,\
 nsx-python-greenlet, nsx-python-logging, nsx-python-protobuf, nsx-rpc-libs, nsx-sfhc,\
 nsx-shared-libs, nsxcli], validation failed.

2020-01-21T11:59:15.970+0000 ERROR [a543b3db4ddf4cdb,000e]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-16] Host validation failed for Host:\
 esxi-dell-g.rainpole.com

2020-01-21T11:59:15.970+0000 DEBUG [a543b3db4ddf4cdb,7f8e]\
 [c.v.v.h.c.s.i.CommissionHostsValidator,om-exec-3]esxi-dell-g.rainpole.com: HOST_VIB_VALIDATION_FAILED

Step 5 Where to look if commission fails?

Even though all the validation checks passed, the domaincontroller.log is still a good place to keep an eye on as the commissioning progresses.

5.1 Hosts in Maintenance Mode

I hit a snag during the commissioning when I found that my ESXi hosts were still in Maintenance Mode from previous activity, even though they still passed the validate test. This is what was reported in the logs:

2020-01-21T12:34:03.082+0000 DEBUG [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.v.p.a.HostMaintenanceModeValidationAction,pool-1-thread-14] Checking Maintenance Mode for host :\
 [esxi-dell-l.rainpole.com, 10.27.51.124]

2020-01-21T12:34:03.082+0000 INFO  [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.c.c.v.vsphere.VcManagerBase,pool-1-thread-14] Retrieving ManagedObjectReference for host\
 esxi-dell-l.rainpole.com

2020-01-21T12:34:03.311+0000 DEBUG [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.c.c.v.v.InventoryService,pool-1-thread-14] No more results to retrieve

2020-01-21T12:34:03.311+0000 INFO  [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.c.c.v.vsphere.VcManagerBase,pool-1-thread-14] ManagedObjectReference of host\
 esxi-dell-l.rainpole.com is ha-host

2020-01-21T12:34:03.323+0000 ERROR [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.v.p.a.HostMaintenanceModeValidationAction,pool-1-thread-14] Host : [esxi-dell-l.rainpole.com,\
 10.27.51.124] is in MAINTENANCE mode.

2020-01-21T12:34:03.323+0000 WARN  [a543b3db4ddf4cdb,07ab]\
 [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,pool-1-thread-14]\
 Shutting down the connection monitor.

2020-01-21T12:34:03.323+0000 WARN  [0000000000000000,0000]\
 [c.v.v.v.c.h.i.HttpConfigurationCompilerBase$ConnectionMonitorThreadBase,VLSI-client-connection-monitor-964]\
 Interrupted, no more connection pool cleanups will be performed.

2020-01-21T12:34:03.324+0000 ERROR [a543b3db4ddf4cdb,07ab]\
 [c.v.e.s.o.model.error.ErrorFactory,pool-1-thread-14] [7FI79R]HOST_IN_MAINTENANCE_MODE Host :\
 [esxi-dell-l.rainpole.com, 10.27.51.124] has been found in MAINTENANCE mode.

com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Host : [esxi-dell-l.rainpole.com,\
 10.27.51.124] has been found in MAINTENANCE mode.

Taking the hosts out of maintenance mode allowed the commissioning to succeed.

And now that we have managed to commission some hosts, we are now ready to build our first VMware Cloud Foundation (WCF) VI Workload Domains (WLD). I will cover that in my next post. All of my VMware Cloud Foundation posts can be found here.