VASA – Error while removing an entry for the provider in the providerCache

Some time ago, I hooked our Virtual Volume (VVol) capable storage array up to a vSphere 6.5 cluster. I did a few preliminary tests, such as adding the VASA (vSphere APIs for Storage Awareness) Provider, creating the VVol datastore, observing the creation of the Protocol Endpoints (PEs), and of course creating some virtual machines as a set of VVols. All went flawlessly. Then, like many other lab tests, I got distracted by a number of other things. In the meantime, the storage array vendor announced support for an updated VASA version, and we had the storage array updated to accommodate this. I picked another newly created cluster running vSphere 6.7, added the VASA Provider, VVol datastore, PEs and everything looked great. After a while, I went back to the vSphere 6.5 cluster, and noticed that my VVol datastore was inaccessible, and my VP registration was offline. I since learnt that in order to support multiple vCenter Servers registering to the same VASA Provider on a VVol array, an external Certificate of Authority is needed and that all vCenters much register with the same cert. Otherwise, as was in my case, an internal vCenter certificates is used. Then as a different vCenter tries to register to the VP, they  step on each others certs. Once I learnt this, I decided to try to get my 6.5 environment reconnected. On trying to remove the VP via the vSphere Client, I hit an error, and was directed to look in the logs for more details. This is what i wanted to talk about in this post.

First, I found this error in the sps.log file which is located in /storage/log/vmware/vmware-sps on the vCenter appliance:

2018-08-02T13:50:47.967Z [pool-22-thread-10] DEBUG opId=StorageProvidersViewMediator-remove-329899-ngc:70036368 com.vmware.vim.sms.task.TaskImpl - Created a task - ManagedObjectReference: type = SmsTask, value = SmsTask-22, serverGuid = 126f5b2d-ae91-4173-bdb6-1fbefcc7e6e2
2018-08-02T13:50:47.967Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.StorageManagerImpl - unregisterProvider() called for provider - 453d453a-ac19-4bf5-b1f7-7252ad84950a
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.task.TaskImpl - Updated progress to 10% for task ManagedObjectReference: type = SmsTask, value = SmsTask-22, serverGuid = 126f5b2d-ae91-4173-bdb6-1fbefcc7e6e2
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.ProviderFactory - [removeProvider] Removing provider - 453d453a-ac19-4bf5-b1f7-7252ad84950a
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.ProviderCache - [removeEntry] Remove the cache entry for the provider - 453d453a-ac19-4bf5-b1f7-7252ad84950a
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.ProviderCache - [removeFdMappingForProvider] No FaultDomainMapping objects found for given fualtDomainId list []
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverHandler - [removeProvider] Removing provider: 453d453a-ac19-4bf5-b1f7-7252ad84950a from priority cache
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - [FAILOVER] Provider failover started: [provider: 453d453a-ac19-4bf5-b1f7-7252ad84950a; array: com.purestorage:d4d78052-ea56-4a7e-9df9-67d1a1634faf]
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - [getBackupProvider] Looking for the new backup provider...
2018-08-02T13:50:47.968Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - Check backup provider b1801691-98a2-4484-b156-f91e55ca54c6 ...
2018-08-02T13:50:47.969Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - NeedsExplicitActivation is not set. Skipping activating provider...
2018-08-02T13:50:47.969Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - [getBackupProvider] Found new backup provider: b1801691-98a2-4484-b156-f91e55ca54c6
2018-08-02T13:50:47.969Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.provider.failover.ProviderFailoverEngine - [updateActiveProvider] Flush event queue for the provider.
2018-08-02T13:50:47.969Z [pool-14-thread-2] ERROR opId= com.vmware.vim.sms.provider.ProviderCache - Error while removing an entry for the provider in the providerCache - java.lang.NullPointerException
        at com.vmware.vim.sms.provider.vasa.VasaProviderImpl.flushEventQueue(VasaProviderImpl.java:1562)
        at com.vmware.vim.sms.provider.failover.ProviderFailoverEngine.removeOldActiveProvider(ProviderFailoverEngine.java:296)
        at com.vmware.vim.sms.provider.failover.ProviderFailoverEngine.failover(ProviderFailoverEngine.java:126)
        at com.vmware.vim.sms.provider.failover.ProviderFailoverEngine.failover(ProviderFailoverEngine.java:73)
        at com.vmware.vim.sms.provider.failover.ProviderFailoverHandler.removeProvider(ProviderFailoverHandler.java:119)
        at com.vmware.vim.sms.provider.ProviderCache.removeEntry(ProviderCache.java:359)
        at com.vmware.vim.sms.provider.ProviderFactory.removeVasaProvider(ProviderFactory.java:422)
        at com.vmware.vim.sms.provider.ProviderFactory.removeProvider(ProviderFactory.java:336)
        at com.vmware.vim.sms.StorageManagerImpl.unregisterProviderInt(StorageManagerImpl.java:578)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vim.sms.task.JobHandler.run(JobHandler.java:70)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-08-02T13:50:47.969Z [pool-14-thread-2] DEBUG opId= com.vmware.vim.sms.task.TaskManagerImpl - [markTaskCompleted] Marked SmsTask-22 as complete

Fortunately, I found a procedure that, as far as I am aware, is not available publicly. These are the steps I followed to remove my original vCenter registration and resolve the situation where I appear to have a ‘stale’ VP present. I’m assuming this can happen for a number of reasons, which is why I decided to post the steps here. My best guess in this case is that my vCenter was known, but no longer trusted (or something to that effect).

Step 1.

SSH to the vCenter Appliance and login as root.

Step 2.

Modify /usr/lib/vmware-vpx/sps/conf/sms-config.xml file to change retryAttempt parameter from 99999 to 4:

<!-- Retry configuration for QS connection -->
       <qsConnection>
          <!-- Retry attempt for QS connection. -->
          <retryAttempt>4</retryAttempt> <--Here
          <!-- Retry interval (in seconds) for QS connection. -->
          <retryInterval>30</retryInterval>
      </qsConnection>

Step 3.

On the vCenter server, block access to the IP address of VP that you cannot remove via static routes. When the SPS services comes up next time, vCenter will mark the VP as Offline as it can no longer reach it.

  # route add -host <StorageProviderIP> reject

Step 4.

On the vCenter server, stop the profile driven storage service (SPS)

  # service-control --stop vmware-sps

Step 5.

On the vCenter server, restart the profile driven storage service (this also initiates the VASA providers)

 # service-control --start vmware-sps

Step 6.

Wait for the SPS service to fully start. This can be checked in the vSphere UI, under Administration > Deployment > System Configuration > Services.

Step 7.

Try again to deleted the problematic VASA storage provider entry via the vSphere client UI. Since it is now offline, it should be removable. This now worked for me.

Step 8.

On the vCenter server, unblock the route to the VASA storage provider that we previously blocked.

 # route del -host <StorageProviderIP> reject

Step 9.

Modify /usr/lib/vmware-vpx/sps/conf/sms-config.xml file to put back the retryAttempt parameter to 99999 from 4:

<!-- Retry configuration for QS connection -->
       <qsConnection>
          <!-- Retry attempt for QS connection. -->
          <retryAttempt>99999</retryAttempt> <--Here
          <!-- Retry interval (in seconds) for QS connection. -->
          <retryInterval>30</retryInterval>
      </qsConnection>

Step 10.

On the vCenter server, stop and start the profile driven storage service (SPS) once more.

 # service-control --stop vmware-sps
 # service-control --start vmware-sps

Step 11.

Wait for the service to fully restart. Now add back the VASA Storage Provider via the UI. This should now work as well (it did for me). Note that this issue should be addressed in some upcoming updates to vSphere, and that these steps should no longer be necessary.