Tips for a successful Virtual SAN (VSAN) Proof Of Concept (POC)

vsan-vmware-virtual-san-boxPretty soon I’ll be heading out on the road to talk at various VMUGs about our first 6 months with VSAN, VMware’s Virtual SAN product. Regular readers will need no introduction to VSAN, and as was mentioned at VMworld this year, we’re gearing up for our next major release. With that in mind, I thought it might be useful to go back over the last 6 months, with a look at some successes, some design decisions you might have to make, what are the available troubleshooting tools, some common gotchas (all those things that will help you have a successful Proof of Concept – POC – with VSAN) and then a quick view at some futures.

What is VSAN (brief version)?

VMware Virtual SAN can be considered a software-based storage solution built into VMware ESXi and managed by vCenter Server. It aggregates local Flash and magnetic disks to create a single shared datastore for virtual machine storage. In a nutshell, it provides a converged compute + storage solution. Because of its distributed architecture, there is no single point of failure. It is deeply integrated with VMware stack, meaning that no agents, VIBs or appliances need to be deployed to use it. You simply need a license to build your VSAN cluster.

To date, VMware over 500 or so VSAN customers, and has won various awards, including Best of TechEd 2014 and Best of Interop 2014. You can read more about the awards here.

VSAN Design – Successful Proof-of-Concepts

Some lessons have been learnt from customers doing Proof Of Concepts with VSAN over the past 6 months. We have received quite a number of questions on how things are supposed to work, what tests should be done, and how to examine the running state of VSAN. There are a number of considerations to be taken into account if you are going to evaluate VSAN successfully. Item number 1 on the list is to ensure that you use components (Storage I/O Controllers, Flash Devices/SSDs) that are qualified for use with VSAN and appear on the VSAN Compatibility Guide. We have had numerous issues with customers using non-qualified controllers and consumer grade SSD (low Outstanding I/O, low concurrency/parallelism) that simply don’t cut it when customers move their production workloads to VSAN. This leads to bad experiences all round. Go with enterprise class SSD/flash every time.

One other thing to keep in mind is that the inbox drivers shipped with OEM ISO images from our partners do not necessarily contain the Storage I/O Controller drivers supported by VSAN. You may need to downgrade the drivers if you use these ISO images. I wrote about this scenario extensively in this post.

Flash capacity sizing is also critical. The design goal should be for your application’s working set to be mostly in cache. VMs running on VSAN will send all their writes to cache, and will always try to read from cache. A read cache miss will result in data being fetched from the magnetic disk layer, thus increasing latency. This is an important design goal, and VMware recommends flash be sized at 10% of consumed VMDK size (excluding any failures to tolerate settings/mirror copies), but you may want to use more depending on your workloads. This value represents a typical working set. Keep in mind, that as you scale up by adding new larger magnetic disks for capacity, and add more virtual machine workloads, you will have to increase your flash capacity too, so plan ahead. It might be good to start with a larger flash size to begin with if you do plan on scaling your workloads on VSAN over time.

You should also be aware that read cache misses, which will undoubtedly happen from time to time, means that VSAN will have to go to the magnetic disk layer for data. If your magnetic disk layer is made up of slow drives, then this will add to the latency. Also, if you only have a single magnetic disk on your host, all cache misses will be directed to this single drive. If you have many drives magnetic disks at your disposal, you could also consider adding a stripe width to your storage policy. All of these should be taken into account.

An interesting point relates to how you can test for failures in a VSAN PoC. SSD and disk failures report errors immediately, and remediation then starts to occur (rebuilding of components, etc). However host and network failures are different. With a host or network failure, it might be that it comes back online in a short space of time (maybe the host has been rebooted for example). For that reason, there is a 60 minute timer (default setting) that needs to expire before remediation action will commence and components on the unreachable host are rebuilt on the remaining hosts in the cluster.

Another important factor is how to simulate a drive failure during a PoC. You should understand that pulling a drive may, or may not, simulate a disk failure. If a disk is pulled, VSAN will mark it as “Absent” as the disk might be re-inserted. If the 60 minute timer expired, the disk is treated the same as “Degraded” and rebuilding of components will occur. A disk failure and a disk pull may be two completely different things. An actual disk failure will start remediation action immediately. It might be worth considering using 3rd party tools from HP or DELL or LSI to offline a drive to see if it simulate a failures rather than a disk pull.

An important consideration is an SSD failure. Its behaves the same way when a failure occurs, but an SSD will impact the whole disk group, and will involve rebuilding of all the components in the disk group if it occurs.

We also get asked what to do with the cache on the controllers. If possible, disable to Storage I/O Controller cache. If that is not possible, set the cache to 100% read. VSAN provides a write buffer in flash for caching I/O. We don’t need to cache it again; it won’t help.

Another design goal is related to number of hosts in the cluster. Although we state 3 hosts at a minimum, should one of these hosts fail, your VMs are going to be running unprotected until you address the issue at hand. Remember that a protected VM object will have 3 components; first copy of the data, second copy of the data and a witness. This is why we need a minimum of 3 hosts – a component will reside on each of the hosts. A second failure before the first failure has been remediated will result in a production down. A VM object cannot remain accessible if more than 50% of its components are unavailable. In a double failure like this, you will have lost 66% of the components. But you can protect against this. One consideration is to use a 4 node minimum configuration, just like EVO:RAIL. In this configuration, if a host fails, then there is a way for VSAN to rebuild the components from the failed host on the remaining hosts in the cluster (if there is enough free capacity). It is strongly recommended, unless you plan to put a 3 node configuration into production, to do your PoC testing with 4 nodes.

disk groupFinally, consider disk groups. Disk groups are containers which contain a flash device and up to 7 magnetic disks. However hosts can have multiple disk groups. So what is the difference between having a single disk group and multiple disk groups? Well, if an SSD fails, it impacts the whole of the disk group and all components will need to be evacuated. If you have a 3 node cluster and a single disk group, then there is no where for the data in the failed disk group to be resynced/reconfigured to, if you have a policy setting of number of failures to tolerate = 1 (remember this needs 3 hosts to each have a component as discussed previously). If you had multiple disk groups on each host, then data in the failed disk group could be synced to the working disk group(s) on the same host, providing additional resilience. In other words, without the second disk group on the same host, a 3 node configuration would be unable to sync the data from the failed host to another host, and is thus left in a situation where another failure could lead to a production down scenario, like we discussed previously. Having a second disk group on each node of a 3 node VSAN cluster can provide additional resilience to certain failures.

These are some of the common design consideration for a VSAN PoC.

VSAN Troubleshooting

There are a bunch of different tool available for troubleshooting VSAN. These range from the vSphere Web Client, ESXCLI, Ruby vSphere Console (RVC), VSAN Observer and of course third-party tools for looking at Storage I/O Controllers and Flash Devices. HP provide various tools (e.g. hpssacli) for examining and configuring HP Smart Array devices, while the MegaCLI tool can be used on various LSI and DELL Storage I/O Controllers. Fusion-IO also provide various CLI tools for examining their PCI-E flash devices. However, the most useful of all these tools, in my option, is VMware’s own RVC. This is a command line tool shipped with vCenter Server (both Linux and Windows versions). Here is a list of RVC commands related to VSAN taken from a 5.5U2 vCenter Server Appliance.

/localhost/ie-datacenter-01/computers/ie-vsan-01/hosts> vsan.<tab><tab>
vsan.apply_license_to_cluster         vsan.host_info
vsan.check_limits                     vsan.host_wipe_vsan_disks
vsan.check_state                      vsan.lldpnetmap
vsan.clear_disks_cache                vsan.obj_status_report
vsan.cluster_change_autoclaim         vsan.object_info
vsan.cluster_info                     vsan.object_reconfigure
vsan.cluster_set_default_policy       vsan.observer
vsan.cmmds_find                       vsan.reapply_vsan_vmknic_config
vsan.disable_vsan_on_cluster          vsan.recover_spbm
vsan.disk_object_info                 vsan.resync_dashboard
vsan.disks_info                       vsan.support_information
vsan.disks_stats                      vsan.vm_object_info
vsan.enable_vsan_on_cluster           vsan.vm_perf_stats
vsan.enter_maintenance_mode           vsan.vmdk_stats
vsan.fix_renamed_vms                  vsan.whatif_host_failures
vsan.host_consume_disks
/localhost/ie-datacenter-01/computers/ie-vsan-01/hosts>

 I use many of these commands all the time, simply to health check my VSAN cluster. You will want to use commands such as vsan.resync_dashboard and vsan.disks_stats to ensure that, once a failure has been resolved, that VSAN has completed remediation before going on to the next test. Remember that VMs deployed on VSAN are typically configured to tolerate one failure, so make sure that the first failure is resolved before testing something else. Otherwise you will introduce multiple failures which could leave your VMs inaccessible.

Some of the other useful ones, in my opinion, are vsan.check_state, vsan.check_limits, vsan.whatif_host_failures. The first two just tell me that everything is within thresholds and healthy. The whatif command is useful for determining if I have enough resources available in the cluster for a component rebuild in the event of a failure.

For deeper troubleshooting, RVC commands such as vsan.disks_info, vsan.disks_stats and vsan.object_info provide deep dive information about the VSAN status. These give deep insight into whether or not there are any underlying storage issues, or indeed virtual machine object/component issues.

For performance monitoring and troubleshooting, then vsan.observer is the go to tool. When launched, it provides a web-based tool for monitoring and troubleshooting VSAN performance. It provides information such as latency, IOPS, bandwidth, Outstanding I/O, Reach Cache Hit Rate, Write Buffer Evictions, Congestion, etc. Basically, it provides all the information you would need to investigate VSAN performance. This is a sample screenshot of what the interface looks like:

VSAN Observer - VSAN clientSomething we heard a lot from our customers is that they no longer want storage to be a “black box”. VSAN Observer, originally designed as an engineering tool but made available to our customers, allows you to have deep insight into low-level VSAN performance and behaviors.

One thing to keep in mind is that the cache will need to warm when workloads are first deployed on VSAN. Therefore it is not uncommon to see spikes in some of the VSAN observer metrics initially, but these settle down over time as the working sets get cached.

I would urge all VSAN customers to become familiar with the VSAN RVC extensions and VSAN Observer as these will allow you to understand how your VSAN environment looks in steady state, and will also make it easier to spot anomalies should they arrive.

VSAN Gotchas

I wanted to add a section to this review to talk about some common gotchas when customers try to do Proof Of Concepts with VSAN. Some of the major gotchas are to do with networking. VSAN requires multicast traffic, and without it, the cluster will not form – it reports misconfiguration detected. Not only that, but each node on VSAN must be able to communicate with every other node over a dedicated VMkernel network tagged for VSAN traffic. This means that any VLAN, MTU size and IP configuration must be consistent across all hosts, and each host should be able to vmkping every other host in the cluster on that VMkernel interface. Here is a very interesting case study regarding an MTU size mismatch on the VSAN network.

Other gotchas related to VSAN not claiming disks, either because there is existing partition information on the disks, or the Storage I/O Controller doesn’t support pass-thru so the ESXi host is not able to see the disks. In this case, a RAID-0 volume must be built on each disk to make them visible to ESXi, and this visible to VSAN. Another issue is that the Storage I/O Controller allows the disks to be shared with another host, so you need to manually build the disk groups by placing the cluster in manual mode. VSAN won’t automatically claim a disk if it can be shared with another host – it will only automatically claim disks that are marked as local to a single ESXi host. However, you can manually claim them. These gotchas are also discussed in a previous post.

There is also a gotcha currently with VSAN Observer and Outstanding I/O (OIO). This is not specific to VSAN Observer, but for those of you involved in VSAN PoCs, this is where it will manifest itself. The OIO values/calculations are not accurate in 5.5U2, and you can read more about that here.

One final note is related to the “default” policy which has led to a number of queries from customers/ If you do not set a policy on a VM being deployed on VSAN, it picks up a default policy which has failures to tolerate set to 1. This means you get availability for your VMs. However, the disks that get deployed will be “thick” by default, and not thin, which is what you get when you have any sort of policy created. This catches a lot of customers out as they see consumed space growing rapidly with this default policy.

VSAN Futures

A number of VSAN futures were hinted at VMworld this year, and have been widely publicized in the press. In no particular order, this is a snippet of some of the items we are planning for a future release of VSAN:

  • All Flash Virtual SAN configurations. The thought is that we use high performance flash devices for the caching layer and then moderate performance flash devices for the persistent layer, although exact configurations are still being discussed internally.
  • A while back, some of you might remember that VMware purchased a software company called Virsto. The plan is to now include the Virsto file-system into VSAN which will enable VSAN to provide a number of additional enterprise class services such as fast cloning as well as provide a new, efficient snapshot mechanism.
  • We also mentioned that we are aiming for 64 node VSAN clusters, allowing even greater scale-out.

Interesting times ahead for sure.

Conclusion

Yes, there are a lot of considerations if you want to have a successful PoC with VSAN. We are working diligently to make sure this becomes easier, and most of of all we want you to be successful if you try it out. The good news is that many of the folks who already have VSAN in production are very pleased with many aspects of VSAN, inclusing performance, reliability and its ease of use.

If you are at one of the EMEA VMUGs next week (UK, Denmark or Belgium) and want to discuss VSAN in further detail, please reach out. I’d be happy to talk with you.

11 comments
  1. Hi Cormac,

    Really useful post.

    I am keen to get your thoughts on using GbE or 10GbE with VSAN, all the materials state that both are fully supported but 10GbE is preferred.

    The problem is that there is no explanation for when you should not use GbE – if you are only doing a few 1,000 IOPS per node I would have thought there would be little value in moving to 10GbE, on the other hand if you need 10,000s of IOPS per node then you would have to use 10GbE.

    With regard to the road-map I would have thought that the most important feature is to enable the ability to support multiple SSDs per disk group.

    This would address two significant current problems:

    1. Failure of an entire disk group when a single SSD fails
    2. Scaling the performance of a disk group

    Support for a two-node cluster (for small remote offices) would also seem desirable.

    What are your thoughts – does this make sense?

    Best regards
    Mark

    • These all make perfect sense – thanks Mark.

      On the 1Gb versus 10Gb, our only recommendation is to dedicate a 1Gb NIC to VSAN traffic if that is what you are using.

      If using 10Gb, you can share the NIC with other traffic types, but we’d suggest using distributed switches with NIOC, and ensure that there is QoS for the VSAN traffic, which means that operations like vMotion don’t impact VSAN.

      I can’t comment on VSAN futures that haven’t already been discussed openly at events such as VMworld, but I’ll share your thought with the extended VSAN team. Thanks for taking the time to post the comment.

  2. Hi Cormac.
    I setup VSAN. But my VSAN run about 2,3 days. It’s has error :
    [IMG]http://pik.vn/2014af267fc8-df51-4898-b57e-98f8f603f769.png[/IMG]
    – I setup VSAN with 3node, use 10GB nic card.
    – Per node: 1xintel SSD 240, 2xWDRE 2TB.
    – Once VM. I have just install OS, don’t install anything.

    • The degraded state associated with the witness suggests that the physical disk on which it resides is errored.

      Please check that all your components (servers, controllers, SSDs, magnetic disks, driver versions and firmware versions) are on the VMware Compatability Guide for VSAN.

      A better place to seek advice like this is the VMware Community forum for Virtual SAN.

  3. You make a good case for deploying a four node VSAN cluster and I agree with your conclusions.

    Small businesses wishing to use VSAN likely are limited to Essentials Plus. I am currently deploying a three node cluster each having a single processor as the only cost effective VSAN solution.

    What is the likelihood of better aligning the licensing requirements of these two products?

  4. Can you clarify this statement…In the book on page 61 of your book and the blog statement.

    “All objects deployed on VSAN are thinly provisioned. This means that no space is reserved at VM deployment time but rather space is consumed as the size of the VM storage grows.

    Hogan, Cormac; Epping, Duncan (2014-07-09). Essential Virtual SAN (VSAN): Administrator’s Guide to VMware Virtual SAN (VMware Press Technology) (p. 61). Pearson Education. Kindle Edition.”

    In this blog post..

    “One final note is related to the “default” policy which has led to a number of queries from customers/ If you do not set a policy on a VM being deployed on VSAN, it picks up a default policy which has failures to tolerate set to 1. This means you get availability for your VMs. However, the disks that get deployed will be “thick” by default, and not thin, which is what you get when you have any sort of policy created. This catches a lot of customers out as they see consumed space growing rapidly with this default policy.”

    • Sure.

      If you deploy with a policy, VM is thin.

      If you deploy without choosing a policy, a default is chosen, and this allows you to pick thin, lazy zero thick and eager zero thick. Lazy is the default format in this case.

Comments are closed.