VSAN resync behaviour when failed component recovers

I had this question a number of times now. Those of you familiar with VSAN will know that if a component goes absent for a period of 60 minutes (default) then VSAN will begin rebuilding a new copy of the component elsewhere in the cluster (if resources allow it). The question then is, if the missing/absent/failed component recovers and becomes visible to VSAN once again, what happens? Will we throw away the component that was just created, or will we throw away the original component that recovered?

[Updated 29-Oct-2015] When I first published this article, the VSAN engineering team reached out to say that what I described was not 100% accurate. What I described was a future plan for the resync process. They then described to me how the process actually works.

  1. If a component goes absent, we will wait 60 minutes (the default grace period) before creating a new mirror for that component. If the absent component comes back within the 60 min grace period, VSAN will not create any new components but it may resync the recovered component if it’s not up-to-date.
  2. If the component has been absent for more than 60 minutes, VSAN will “fix” it by creating a new mirror of this component. In the meanwhile, VSAN marks the old absent one to be “transient” (meaning we will get rid of it later). After the new component is fully resynced (so we have availability compliance), VSAN removes the old (absent) transient component.
  3. If the old absent component comes back during the resyncing period for the new component, VSAN will resync the recovered component as well and bring it back to an active state, but this old component will eventually be removed (even though it’s fully resynced) once the newly created component has been fully resynced.

So what about the proposed improved mechanism? This is how I described it in the first version of the post. After the default 60 minute timeout,  a new component is created and it starts resynching. If the original component comes back, it remains  in an absent state but it would start to resynch as well. VSAN internally treats this as a resynching component, similar to the new component that is already resynching. Then whichever of the two components finishes resynching first is kept whilst the other component is cleaned up.

Let’s look at an example. Here we have a 4 node VSAN cluster, with multiple VMs deployed. Let’s look at one VM in particular, which has two components (c1 and c2) and a witness:

resync1Now there is a host failure, and it is the host where component c1 resides:

resync2Let’s assume that the host has been shutdown. The component now enters an absent state. If 60 minutes passes by and the component has not come back online, VSAN begins rebuilding the missing component elsewhere in the cluster (in other words build another copy of c2). In this example I have tagged it as c3:

resync3So far, so good. Now lets say that whatever issue affected the first host has now been resolved. This means that the host rejoins the cluster and now the original component c1 is visible again. In this case, VSAN will continue to sync c1 to c2, and will continue to build c3.

resync4Which ever of these components synchronizes first is the one that is kept. The other component is then discarded. In this example, I have assumed that c3 finishes building sooner than c1 synchronizes:

resync5Now the VM’s object is compliant once again.

However the side effect of both components resyching is that it creates additional network traffic. Ideally VSAN would determine which one will resynchronize first and only keep that one. We are also looking at optimizing some of this behavior going forward.

3 comments
  1. I’m glad to hear that the mechanism for failed component recovery will be improved! I had this scenario occur for a 20TB VM and the resync took 10 days in addition to eating an extra 20TB and putting our VSAN Datastore at 95%+ capacity during the 10 days. With this new proposed mechanism, it probably would have resolved in under an hour instead of 10 days.

    I think I have also seen similar behavior when setting VMDKs from FTT=0 to FTT=1 and then cancelling (setting it back to FTT=0 before it completes the FTT=1 copy). The FTT=1 copy is still created and synced and then deleted. For very large VMDKs, this also currently ends up as a week+ resync and loss of space during the long resync time when I think it could be resolved shortly.

  2. Thanks for this article! One question: how this works on a 3-node system? Or it is just waiting for the 3rd node to come back? We have a 3-node vSAN system and I realized that if one node is down some guests keep running without troubles while others look like they would continue running but they become unaccessible via system or SSH console. This is especially true for Debian 5/6 and Centos 5/6/7 guests.

    • If FTT=1 (which is the default), then yes, you would need to wait for the failed node to return. This is why we strongly recommend a fourth node if at all possible. This allows VSAN to self-heal if there is a failure.

Comments are closed.