Read locality in VSAN stretched cluster

Many regular readers will know that we do not do read locality in Virtual SAN. For VSAN, it has always been a trade-off of networking vs. storage latency. Let me give you an example. When we deploy a virtual machine with multiple objects (e.g. VMDK), and this VMDK is mirrored across two disks on two different hosts, we read in a round-robin fashion from both copies based on the block offset. Similarly, as the number of failures to tolerate is increased, resulting in additional mirror copies, we continue to read in a round-robin fashion from each copy, again based on block offset. In fact, we don’t even need to have the VM’s compute reside on the same host as a copy of the data. In other words, the compute could be on host 1, the first copy of the data could be on host 2 and the second copy of the data could be on host 3. Yes, I/O will have to do a single network hop, but when compared to latency in the I/O stack itself, this is negligible. The cache associated with each copy of the data is also warmed, as reads are requested. The added benefit of this approach is that vMotion operations between any of the hosts in the VSAN cluster do not impact the performance of the VM – we can migrate the VM to our hearts content and still get the same performance.

round-robin-readsSo that’s how things were up until the VSAN 6.1 release. There is now a new network latency element which changes the equation when we talk about VSAN stretched clusters. The reasons for this change will become obvious shortly.

What if we made no change to read algorithm?

Let’s talk about the round-robin, block offset algorithm in a stretched cluster environment first of all. For the sake of arguments, lets assume that read latency in a standard hybrid VSAN cluster is 2ms. That means that the VM can move between any hosts in this cluster, and latency remains constant. Now we introduce the concept of a hybrid VSAN stretched cluster, where hosts that are part of the same cluster can be located miles/kilometers apart. The guidance that we are giving our VSAN stretched cluster customers is that latency between sites should be no more than 5ms RTT. Therefore a read operation, which was previously 2ms, could now incur up to 2ms + 5ms if it has to traverse the link between the sites. We are looking at a 2-3 fold increase in latency if the original round-robin, block-offset, read algorithm was left in place.

A new read algorithm for VSAN 6.1

However, we have not left the behaviour like that. Instead, a new algorithm specifically for stretch clusters is introduced in VSAN 6.1. Rather than continuing to read in a round-robin, block offset fashion, VSAN now has the smarts to figure out which site a VM is running on, and do 100% of its read from the copy of the data on the local site. This means that there are no reads done across the link during steady-state operations. It also means that all of the caching is done on the local site. This avoid incurring any additional latency as reads do not have to traverse the inter-site link. Note that this is not read locality on a per host basis. It is read locality on a per site basis. On the same site, the VM’s compute could be on one hosts while its local data object could be on another host, just the same as a standard non-stretched VSAN deployment.

read-localityWhat happens on a failure?

We recommend the use of vSphere HA and Host/Affinity rules in VSAN stretched cluster. This means that if there is a host failure on one site, vSphere HA will try to restart the VM on the same site. Thus the VM would still have read locality, and access to the hot/warm cached blocks of data.

If there is a failure that impacts the local copy of the data, then the remote copy (and cache) will be accessed until a replacement copy of the data is rebuilt on the original site. During this time, the expectation is that the virtual machine may not perform optimally until the replica is rebuilt and the cache has re-warmed on the original site. Customers may decide to migrate virtual machines to the new site while the rebuild is taking place.

If there is a catastrophic failure however, and the whole of a site goes down, virtual machines on the failed site are restarted on the remaining site. In this case, reads will come from the replica copy on the remaining site (site locality once again), but cache will be cold. During this time, the expectation is that the virtual machine may not perform optimally until the cache has warmed on the new site. Remember that prior to the failure, 100% of the reads were coming from the original site’s objects, so only the cache on the original site was warm. Now that the VM has failed over to the other site, the cold cache must be warmed.

What about all-flash? No read cache, so what happens?

It is true to say that, since the capacity layer in an all-flash VSAN is also flash, there is no need to cache reads in a dedicated cache layer like we do for hybrid.  However, the read locality behaviour still has a purpose as it allows VMs to read from the local copy of the data rather than traversing the inter-site link. Of course, the same performance considerations arise if the VM has to read across the inter-site link in the event of a failure. One interesting point is that in the event of the VM being restarted on the other site by vSphere HA, it means that there is no need to warm a read cache in the event of a failure. It will be performing optimally immediately after failover.

6 comments
  1. Cormac,
    you said: “since the capacity layer in an all-flash VSAN is also flash, there is no need to cache reads in a dedicated cache layer like we do for hybrid”.
    I understand that in a all-flash (NOT streched) config I do not need a caching layer? I can mark all my flash as a capacity layer? Is it right?

    • No – that’s not correct. There is still the concept of a cache layer and a capacity layer, both being made up of flash devices. All writes still go to the cache layer, which we recommend being of reasonably high spec and quality. Data is still moved to the capacity layer as it goes from hot->warm->cold. Reads will come from the cache layer if the block is present, or it is fetched from the capacity layer. However since it is also flash on the capacity layer, we have no reason to cache the reads. There is more information about the algorithms in the beginning of this post: http://cormachogan.com/2015/05/19/vsan-6-0-part-10-10-cache-recommendation-for-af-vsan/

  2. Not pertaining to this article, but to stretched Vsan. If the Vsan link between sites is down why does the site that didn’t form a quorum with the witness have all its vms restated? The vmdks are obviously there?

  3. So we have the data in both sites and Rules in place to make sure one copy is in Site A and 1 in site B. If Site A fails completely Is there a soft rule to be able to rebuild the +1 copy of the data on the remaining site? If not I assume you would have to turn off the Rules that separate the copies to allow that to occur manually?

    • The affinity rules are only for tying the compute section of the VMs to a particular site. These are indeed soft, so that when a site fails, the VM can be started on the remaining site.

      We would not rebuild any data on site B if site A fails completely. It is only when site A recovers that we rebuild/resync data from site B (which was always running) back to site A.

      Currently there is no way to add a +1 copy to a surviving site. This is a feature we are looking at implementing going forward.

Comments are closed.