Many seasoned VSAN administrators will know how heavily we rely on VSAN Observer to get an understanding of the underlying performance of VSAN. While VSAN Observer is a very powerful tool, it does have some drawbacks. For one, it does not provide historic performance data, it simply gives a real-time view of the state of the system as it is currently, not what it was like previously. VSAN Observer is also a separate tool and is not integrated with vSphere web client, thus you didn’t have a “single pane of glass” view of the system. The tool is also complex, providing a lot of metrics that are engineering level metrics, and not really customer consumable. It also has an impact on vCenter Server, as the tool is launched via RVC, the Ruby vSphere Console, and RVC typically resides on the vCenter Server. With these limitations in mind, VSAN 6.2 introduces a new service to assist administrators in getting a detailed understanding of VSAN performance without the limitations outlined here.
This new performance service offers the following:
Integrated with vSphere web client
The first point to make is that the new Performance Service is fully consumable via the vSphere web client. There is no need to login to RVC, and there is no need to manually launch the service. Once the service has been enabled once by the administrator, it is “always on”. All the graphs for the performance service can be found under the Performance > Monitor view when a cluster, host or VM is selected in the vCenter server inventory.
The number of metrics displayed by the performance service is significantly less that the metrics displayed by VSAN Observer. However, the metrics that are now displayed are far more consumable to an administrator. When a cluster, host or VM view of performance is selected, administrators can opt to show either the VM consumption or the VSAN back-end consumption. For example, if a VM with a RAID-1 configuration was examined, you may see 500 write ops from the VM view, but since the VMDK is mirrored, there would be 1000 write ops on the back-end, 500 to each replica.
Other metrics displayed in these view include Throughput, Latency, Congestions and Outstanding IO.
The host view also offers some additional metrics for both Disk Group and Disk. The Disk Group view has metrics such as Read Cache Hit Rate (only relevant to hybrid VSAN), but also Evictions, Write Buffer Free Percentage as well as a number of additional useful counters.
The VM view will provide the administrator metrics on a per VMDK, including VSCSI IOPS, Throughput and Latency. If you are unsure about a particular metric, we are providing “tool tips” with each metric which you can click on (i for information) and this will give you more information on the actual metric, and what it is collecting. There are also AskVMware links provided for further information too.
One of the design goals of the performance service was to make sure there was no performance impact on vCenter Server. For this reason, the performance service was designed to utilize VSAN’s distributed architecture. Performance stats are stored in a stats DB which is deployed as an object on the VSAN datastore when the performance service is enabled. Each host in the cluster selects a master for updating the stats DB, and when one of the hosts in the cluster is elected for this role, all other ESXi hosts send their statistics to that host for persisting in the stats DB. The stats are averaged over a 5 minute period. The stats DB is then queried by the web client when the graphs are rendered in the UI.
The other design goal was to not have any reliance on vCenter server. Therefore if something should happen to vCenter server, and a new vCenter server needs to be deployed, the performance statistics continue to be captured for future reference.
No single point of failure
We mentioned already that the stats DB is created on the VSAN datastore. Therefore it may have a VM Storage Policy associated with it for high availability. When the performance service is enabled, the administrator is asked to choose a policy for the stats DB object. By default, the default VSAN policy is chosen, which provides Number of Failures To Tolerate = 1. So even if there is a host failure, the stats collection will continue.
The state of the stats DB and the performance service are also checked via the health check system, so administrators are alerted if anything is wrong with the service. Here is an example of such a failure, when a split-brain is introduced in the cluster:
The final point to make is that the new performance service overcomes a severe limitation in VSAN Observer, namely the ability to look at historic data. By default, the performance service looks at the last 1 hour of data, but you can change this of course.
That should hopefully provide you with a decent overview of the new performance service in VSAN 6.2. While VSAN Observer continues to be available via RVC, we feel that this new service should provide answers to the vast majority of performance queries you may have on VSAN.
Of course, if you feel something else should be included, especially a field or metric that you regularly use in VSAN Observer, please let me know and I will provide the feedback to the product managers and engineers.