Pretty soon I’ll be heading out on the road to talk at various VMUGs about our first 6 months with VSAN, VMware’s Virtual SAN product. Regular readers will need no introduction to VSAN, and as was mentioned at VMworld this year, we’re gearing up for our next major release. With that in mind, I thought it might be useful to go back over the last 6 months, with a look at some successes, some design decisions you might have to make, what are the available troubleshooting tools, some common gotchas (all those things that will help you have a successful Proof of Concept – POC – with VSAN) and then a quick view at some futures.
What is VSAN (brief version)?
VMware Virtual SAN can be considered a software-based storage solution built into VMware ESXi and managed by vCenter Server. It aggregates local Flash and magnetic disks to create a single shared datastore for virtual machine storage. In a nutshell, it provides a converged compute + storage solution. Because of its distributed architecture, there is no single point of failure. It is deeply integrated with VMware stack, meaning that no agents, VIBs or appliances need to be deployed to use it. You simply need a license to build your VSAN cluster.
To date, VMware over 500 or so VSAN customers, and has won various awards, including Best of TechEd 2014 and Best of Interop 2014. You can read more about the awards here.
VSAN Design – Successful Proof-of-Concepts
Some lessons have been learnt from customers doing Proof Of Concepts with VSAN over the past 6 months. We have received quite a number of questions on how things are supposed to work, what tests should be done, and how to examine the running state of VSAN. There are a number of considerations to be taken into account if you are going to evaluate VSAN successfully. Item number 1 on the list is to ensure that you use components (Storage I/O Controllers, Flash Devices/SSDs) that are qualified for use with VSAN and appear on the VSAN Compatibility Guide. We have had numerous issues with customers using non-qualified controllers and consumer grade SSD (low Outstanding I/O, low concurrency/parallelism) that simply don’t cut it when customers move their production workloads to VSAN. This leads to bad experiences all round. Go with enterprise class SSD/flash every time.
One other thing to keep in mind is that the inbox drivers shipped with OEM ISO images from our partners do not necessarily contain the Storage I/O Controller drivers supported by VSAN. You may need to downgrade the drivers if you use these ISO images. I wrote about this scenario extensively in this post.
Flash capacity sizing is also critical. The design goal should be for your application’s working set to be mostly in cache. VMs running on VSAN will send all their writes to cache, and will always try to read from cache. A read cache miss will result in data being fetched from the magnetic disk layer, thus increasing latency. This is an important design goal, and VMware recommends flash be sized at 10% of consumed VMDK size (excluding any failures to tolerate settings/mirror copies), but you may want to use more depending on your workloads. This value represents a typical working set. Keep in mind, that as you scale up by adding new larger magnetic disks for capacity, and add more virtual machine workloads, you will have to increase your flash capacity too, so plan ahead. It might be good to start with a larger flash size to begin with if you do plan on scaling your workloads on VSAN over time.
You should also be aware that read cache misses, which will undoubtedly happen from time to time, means that VSAN will have to go to the magnetic disk layer for data. If your magnetic disk layer is made up of slow drives, then this will add to the latency. Also, if you only have a single magnetic disk on your host, all cache misses will be directed to this single drive. If you have many drives magnetic disks at your disposal, you could also consider adding a stripe width to your storage policy. All of these should be taken into account.
An interesting point relates to how you can test for failures in a VSAN PoC. SSD and disk failures report errors immediately, and remediation then starts to occur (rebuilding of components, etc). However host and network failures are different. With a host or network failure, it might be that it comes back online in a short space of time (maybe the host has been rebooted for example). For that reason, there is a 60 minute timer (default setting) that needs to expire before remediation action will commence and components on the unreachable host are rebuilt on the remaining hosts in the cluster.
Another important factor is how to simulate a drive failure during a PoC. You should understand that pulling a drive may, or may not, simulate a disk failure. If a disk is pulled, VSAN will mark it as “Absent” as the disk might be re-inserted. If the 60 minute timer expired, the disk is treated the same as “Degraded” and rebuilding of components will occur. A disk failure and a disk pull may be two completely different things. An actual disk failure will start remediation action immediately. It might be worth considering using 3rd party tools from HP or DELL or LSI to offline a drive to see if it simulate a failures rather than a disk pull.
An important consideration is an SSD failure. Its behaves the same way when a failure occurs, but an SSD will impact the whole disk group, and will involve rebuilding of all the components in the disk group if it occurs.
We also get asked what to do with the cache on the controllers. If possible, disable to Storage I/O Controller cache. If that is not possible, set the cache to 100% read. VSAN provides a write buffer in flash for caching I/O. We don’t need to cache it again; it won’t help.
Another design goal is related to number of hosts in the cluster. Although we state 3 hosts at a minimum, should one of these hosts fail, your VMs are going to be running unprotected until you address the issue at hand. Remember that a protected VM object will have 3 components; first copy of the data, second copy of the data and a witness. This is why we need a minimum of 3 hosts – a component will reside on each of the hosts. A second failure before the first failure has been remediated will result in a production down. A VM object cannot remain accessible if more than 50% of its components are unavailable. In a double failure like this, you will have lost 66% of the components. But you can protect against this. One consideration is to use a 4 node minimum configuration, just like EVO:RAIL. In this configuration, if a host fails, then there is a way for VSAN to rebuild the components from the failed host on the remaining hosts in the cluster (if there is enough free capacity). It is strongly recommended, unless you plan to put a 3 node configuration into production, to do your PoC testing with 4 nodes.
Finally, consider disk groups. Disk groups are containers which contain a flash device and up to 7 magnetic disks. However hosts can have multiple disk groups. So what is the difference between having a single disk group and multiple disk groups? Well, if an SSD fails, it impacts the whole of the disk group and all components will need to be evacuated. If you have a 3 node cluster and a single disk group, then there is no where for the data in the failed disk group to be resynced/reconfigured to, if you have a policy setting of number of failures to tolerate = 1 (remember this needs 3 hosts to each have a component as discussed previously). If you had multiple disk groups on each host, then data in the failed disk group could be synced to the working disk group(s) on the same host, providing additional resilience. In other words, without the second disk group on the same host, a 3 node configuration would be unable to sync the data from the failed host to another host, and is thus left in a situation where another failure could lead to a production down scenario, like we discussed previously. Having a second disk group on each node of a 3 node VSAN cluster can provide additional resilience to certain failures.
These are some of the common design consideration for a VSAN PoC.
There are a bunch of different tool available for troubleshooting VSAN. These range from the vSphere Web Client, ESXCLI, Ruby vSphere Console (RVC), VSAN Observer and of course third-party tools for looking at Storage I/O Controllers and Flash Devices. HP provide various tools (e.g. hpssacli) for examining and configuring HP Smart Array devices, while the MegaCLI tool can be used on various LSI and DELL Storage I/O Controllers. Fusion-IO also provide various CLI tools for examining their PCI-E flash devices. However, the most useful of all these tools, in my option, is VMware’s own RVC. This is a command line tool shipped with vCenter Server (both Linux and Windows versions). Here is a list of RVC commands related to VSAN taken from a 5.5U2 vCenter Server Appliance.
/localhost/ie-datacenter-01/computers/ie-vsan-01/hosts> vsan.<tab><tab> vsan.apply_license_to_cluster vsan.host_info vsan.check_limits vsan.host_wipe_vsan_disks vsan.check_state vsan.lldpnetmap vsan.clear_disks_cache vsan.obj_status_report vsan.cluster_change_autoclaim vsan.object_info vsan.cluster_info vsan.object_reconfigure vsan.cluster_set_default_policy vsan.observer vsan.cmmds_find vsan.reapply_vsan_vmknic_config vsan.disable_vsan_on_cluster vsan.recover_spbm vsan.disk_object_info vsan.resync_dashboard vsan.disks_info vsan.support_information vsan.disks_stats vsan.vm_object_info vsan.enable_vsan_on_cluster vsan.vm_perf_stats vsan.enter_maintenance_mode vsan.vmdk_stats vsan.fix_renamed_vms vsan.whatif_host_failures vsan.host_consume_disks /localhost/ie-datacenter-01/computers/ie-vsan-01/hosts>
I use many of these commands all the time, simply to health check my VSAN cluster. You will want to use commands such as vsan.resync_dashboard and vsan.disks_stats to ensure that, once a failure has been resolved, that VSAN has completed remediation before going on to the next test. Remember that VMs deployed on VSAN are typically configured to tolerate one failure, so make sure that the first failure is resolved before testing something else. Otherwise you will introduce multiple failures which could leave your VMs inaccessible.
Some of the other useful ones, in my opinion, are vsan.check_state, vsan.check_limits, vsan.whatif_host_failures. The first two just tell me that everything is within thresholds and healthy. The whatif command is useful for determining if I have enough resources available in the cluster for a component rebuild in the event of a failure.
For deeper troubleshooting, RVC commands such as vsan.disks_info, vsan.disks_stats and vsan.object_info provide deep dive information about the VSAN status. These give deep insight into whether or not there are any underlying storage issues, or indeed virtual machine object/component issues.
For performance monitoring and troubleshooting, then vsan.observer is the go to tool. When launched, it provides a web-based tool for monitoring and troubleshooting VSAN performance. It provides information such as latency, IOPS, bandwidth, Outstanding I/O, Reach Cache Hit Rate, Write Buffer Evictions, Congestion, etc. Basically, it provides all the information you would need to investigate VSAN performance. This is a sample screenshot of what the interface looks like:
Something we heard a lot from our customers is that they no longer want storage to be a “black box”. VSAN Observer, originally designed as an engineering tool but made available to our customers, allows you to have deep insight into low-level VSAN performance and behaviors.
One thing to keep in mind is that the cache will need to warm when workloads are first deployed on VSAN. Therefore it is not uncommon to see spikes in some of the VSAN observer metrics initially, but these settle down over time as the working sets get cached.
I would urge all VSAN customers to become familiar with the VSAN RVC extensions and VSAN Observer as these will allow you to understand how your VSAN environment looks in steady state, and will also make it easier to spot anomalies should they arrive.
I wanted to add a section to this review to talk about some common gotchas when customers try to do Proof Of Concepts with VSAN. Some of the major gotchas are to do with networking. VSAN requires multicast traffic, and without it, the cluster will not form – it reports misconfiguration detected. Not only that, but each node on VSAN must be able to communicate with every other node over a dedicated VMkernel network tagged for VSAN traffic. This means that any VLAN, MTU size and IP configuration must be consistent across all hosts, and each host should be able to vmkping every other host in the cluster on that VMkernel interface. Here is a very interesting case study regarding an MTU size mismatch on the VSAN network.
Other gotchas related to VSAN not claiming disks, either because there is existing partition information on the disks, or the Storage I/O Controller doesn’t support pass-thru so the ESXi host is not able to see the disks. In this case, a RAID-0 volume must be built on each disk to make them visible to ESXi, and this visible to VSAN. Another issue is that the Storage I/O Controller allows the disks to be shared with another host, so you need to manually build the disk groups by placing the cluster in manual mode. VSAN won’t automatically claim a disk if it can be shared with another host – it will only automatically claim disks that are marked as local to a single ESXi host. However, you can manually claim them. These gotchas are also discussed in a previous post.
There is also a gotcha currently with VSAN Observer and Outstanding I/O (OIO). This is not specific to VSAN Observer, but for those of you involved in VSAN PoCs, this is where it will manifest itself. The OIO values/calculations are not accurate in 5.5U2, and you can read more about that here.
One final note is related to the “default” policy which has led to a number of queries from customers/ If you do not set a policy on a VM being deployed on VSAN, it picks up a default policy which has failures to tolerate set to 1. This means you get availability for your VMs. However, the disks that get deployed will be “thick” by default, and not thin, which is what you get when you have any sort of policy created. This catches a lot of customers out as they see consumed space growing rapidly with this default policy.
A number of VSAN futures were hinted at VMworld this year, and have been widely publicized in the press. In no particular order, this is a snippet of some of the items we are planning for a future release of VSAN:
- All Flash Virtual SAN configurations. The thought is that we use high performance flash devices for the caching layer and then moderate performance flash devices for the persistent layer, although exact configurations are still being discussed internally.
- A while back, some of you might remember that VMware purchased a software company called Virsto. The plan is to now include the Virsto file-system into VSAN which will enable VSAN to provide a number of additional enterprise class services such as fast cloning as well as provide a new, efficient snapshot mechanism.
- We also mentioned that we are aiming for 64 node VSAN clusters, allowing even greater scale-out.
Interesting times ahead for sure.
Yes, there are a lot of considerations if you want to have a successful PoC with VSAN. We are working diligently to make sure this becomes easier, and most of of all we want you to be successful if you try it out. The good news is that many of the folks who already have VSAN in production are very pleased with many aspects of VSAN, inclusing performance, reliability and its ease of use.
If you are at one of the EMEA VMUGs next week (UK, Denmark or Belgium) and want to discuss VSAN in further detail, please reach out. I’d be happy to talk with you.