VSAN 6.2 Part 6 – Performance Service

perf-monMany seasoned VSAN administrators will know how heavily we rely on VSAN Observer to get an understanding of the underlying performance of VSAN. While VSAN Observer is a very powerful tool, it does have some drawbacks. For one, it does not provide historic performance data, it simply gives a real-time view of the state of the system as it is currently, not what it was like previously. VSAN Observer is also a separate tool and is not integrated with vSphere web client, thus you didn’t have a “single pane of glass” view of the system. The tool is also complex, providing a lot of metrics that are engineering level metrics, and not really customer consumable. It also has an impact on vCenter Server, as the tool is launched via RVC, the Ruby vSphere Console, and RVC typically resides on the vCenter Server. With these limitations in mind, VSAN 6.2 introduces a new service to assist administrators in getting a detailed understanding of VSAN performance without the limitations outlined here.

This new performance service offers the following:

Integrated with vSphere web client

The first point to make is that the new Performance Service is fully consumable via the vSphere web client. There is no need to login to RVC, and there is no need to manually launch the service. Once the service has been enabled once by the administrator, it is “always on”. All the graphs for the performance service can be found under the Performance > Monitor view when a cluster, host or VM is selected in the vCenter server inventory.

cluster viewSimplified Metrics

The number of metrics displayed by the performance service is significantly less that the metrics displayed by VSAN Observer. However, the metrics that are now displayed are far more consumable to an administrator. When a cluster, host or VM view of performance is selected, administrators can opt to show either the VM consumption or the VSAN back-end consumption. For example, if a VM with a RAID-1 configuration was examined, you may see 500 write ops from the VM view, but since the VMDK is mirrored, there would be 1000 write ops on the back-end, 500 to each replica.

Other metrics displayed in these view include Throughput, Latency, Congestions and Outstanding IO.

The host view also offers some additional metrics for both Disk Group and Disk. The Disk Group view has metrics such as Read Cache Hit Rate (only relevant to hybrid VSAN), but also Evictions, Write Buffer Free Percentage as well as a number of additional useful counters.

The VM view will provide the administrator metrics on a per VMDK, including VSCSI IOPS, Throughput and Latency. If you are unsure about a particular metric, we are providing “tool tips” with each metric which you can click on (i for information) and this will give you more information on the actual metric, and what it is collecting. There are also AskVMware links provided for further information too.

perf-serv-vmThe metric that are displayed are the ones that we feel are the ones that will be used most often by administrators of VSAN.

Distributed Architecture

One of the design goals of the performance service was to make sure there was no performance impact on vCenter Server. For this reason, the performance service was designed to utilize VSAN’s distributed architecture. Performance stats are stored in a stats DB which is deployed as an object on the VSAN datastore when the performance service is enabled. Each host in the cluster selects a master for updating the stats DB, and when one of the hosts in the cluster is elected for this role, all other ESXi hosts send their statistics to that host for persisting in the stats DB. The stats are averaged over a 5 minute period. The stats DB is then queried by the web client when the graphs are rendered in the UI.

The other design goal was to not have any reliance on vCenter server. Therefore if something should happen to vCenter server, and a new vCenter server needs to be deployed, the performance statistics continue to be captured for future reference.

No single point of failure

We mentioned already that the stats DB is created on the VSAN datastore. Therefore it may have a VM Storage Policy associated with it for high availability. When the performance service is enabled, the administrator is asked to choose a policy for the stats DB object. By default, the default VSAN policy is chosen, which provides Number of Failures To Tolerate = 1. So even if there is a host failure, the stats collection will continue.

1-1 turn on perf serviceThe state of the stats DB and the performance service are also checked via the health check system, so administrators are alerted if anything is wrong with the service. Here is an example of such a failure, when a split-brain is introduced in the cluster:

perf-service-health-errorHistorical Information available

The final point to make is that the new performance service overcomes a severe limitation in VSAN Observer, namely the ability to look at historic data. By default, the performance service looks at the last 1 hour of data, but you can change this of course.

perf-service-lastIf you wish to look at a specific period of time, the time range can be changed to custom, as follows:

perf-service-customThis allows you to select a specific time range to look at.

That should hopefully provide you with a decent overview of the new performance service in VSAN 6.2. While VSAN Observer continues to be available via RVC, we feel that this new service should provide answers to the vast majority of performance queries you may have on VSAN.

Of course, if you feel something else should be included, especially a field or metric that you regularly use in VSAN Observer, please let me know and I will provide the feedback to the product managers and engineers.

15 comments
  1. Hello Cormac,

    Thank you very much for taking time to write about all these details. I have a question though, what is the default max database data retention period? I mean, how far in the past can statistics be queried?

    Kind Regards,
    Ángel

  2. My stats master has died… Is there a way to disable the service, and or set a new stats master? I cannot stop the service since it can’t find the stats master object. Funny enough the service reports all is well. “Stats master is not found in the cluster”.

        • You could try checking the status of the vsan management service on each of the hosts, and restarting it if it seems to be offline:

          [root@esxi-hp-01:/etc/init.d] ./vsanmgmtd status
          vsanperfsvc is running
          [root@esxi-hp-01:/etc/init.d] ./vsanmgmtd restart
          ]watchdog-vsanperfsvc: Terminating watchdog process with PID 102124
          vsanperfsvc started
          [root@esxi-hp-01:/etc/init.d] ]

  3. I’m having a problem getting the performance service enabled on clusters inside vCenters that are part of an Enhanced Linked Mode setup. When attempting to turn it on and select the storage policy for the database the dropdown menu for the policy is blank and never populates with any policies.

  4. Hello Cormac,
    I’m seeing an issue after applying a host profile across a three node cluster which is affecting two out of the three hosts. The VSAN health is giving me a warning stating that in the performance service there are ‘hosts not contributing stats’. I confirmed that the vsanperfsvc is running on the hosts. The only host that is reporting performance stats is the host I used to create the host profile in the first place. I also used the ESXi 6.0 hardening guidelines as part of the profile. What could be the cause of this problem?
    Thanks,
    Anthony

    • Try disabling the service through web client, then restarting the service on the ESXi hosts, then enabling the daemon once more via the client. Also make sure that there is free space on /locker on the ESXi hosts. If that doesn’t fix it, I would speak to GSS.

  5. Hi Cormac, I’m having a hard time setting up the performance service as Justin was above. The selection of the storage policy is blank, despite having policies created. Logoff/Logon, even restarting the VCSA haven’t resolved it. I’ve been talking to GSS, they are not sure yet either. Everything comes up green in the VSAN 6.2 health check. Any ideas?
    Thanks,

    Brad

    • Not sure, but it may simply be a UI issue Brad. If you go ahead and enable the service anyway, it should simply pick up the default policy, and you can verify this afterwards. Do you have issues with seeing the policies of the VMs, or selecting a policy when creating a new VM?

Comments are closed.