Erasure Coding and Quorum on vSAN

I was looking at the layout of RAID-5 object configuration the other day, and while these objects were deployed on vSAN with 4 components, something caught my eye. It wasn’t the fact that there were 4 components, which is what one would expect since we implement RAID-5 as a 3+1, i.e. 3 data segments and 1 parity segment. No, what caught my eye was that one of the components had a different vote count. Now, RAID-5 and RAID-6 erasure coding configurations are not the same as RAID-1. With RAID-1, we deploy multiple copies of the data depending on how many failures we wish to tolerate, and one or more witness components (in the majority of cases – there are some exceptions). With erasure coding, we do not have/need a witness component. Would this explain why we have an extra vote on one component? I had a few conversations with our vSAN engineering team, and this is what I learnt.

First, let’s take a look at what I am talking about. Here is the output in question, using RVC (Ruby vSphere Console).

Disk backing: [vsanDatastore] 09fdef58-84a9-38f0-bd2a-246e962f4910/cor-temp-r5-vm.vmdk
    DOM Object: 0cfdef58-7cc7-bcdf-ec12-246e962f4910 (v5, owner: esxi-dell-e.rainpole.com, 
    proxy owner: None, policy: spbmProfileName = PFTT=0,SFTT=R5,Affinity, affinityMandatory = 1, 
    CSN = 2, affinity = ["a054ccb4-ff68-4c73-cbc2-d272d45e32df"], locality = Preferred, 
    spbmProfileId = 7fd88ad0-45c4-4860-93bf-0e7debbaed4f, replicaPreference = Capacity, 
    spbmProfileGenerationNumber = 0, hostFailuresToTolerate = 0, subFailuresToTolerate = 1)
      RAID_5 
        Component: 0cfdef58-230f-87e0-c990-246e962f4910 (state: ACTIVE (5), host: esxi-dell-g.rainpole.com, 
                   md: naa.500a07510f86d69d, ssd: naa.55cd2e404c31f5fc,
                   votes: 2, usage: 0.0 GB, proxy component: false)
        Component: 0cfdef58-f621-89e0-24b4-246e962f4910 (state: ACTIVE (5), host: esxi-dell-e.rainpole.com, 
                   md: naa.500a07510f86d6bb, ssd: naa.5001e820026415f0,
                   votes: 1, usage: 0.0 GB, proxy component: false)
        Component: 0cfdef58-5c60-8ae0-54ed-246e962f4910 (state: ACTIVE (5), host: esxi-dell-f.rainpole.com, 
                   md: naa.500a07510f86d6b3, ssd: naa.5001e82002664b00,
                   votes: 1, usage: 0.0 GB, proxy component: false)
        Component: 0cfdef58-11ac-8be0-f34a-246e962f4910 (state: ACTIVE (5), host: esxi-dell-h.rainpole.com, 
                   md: naa.500a07510f86d6bc, ssd: naa.55cd2e404c31f898,
                   votes: 1, usage: 0.0 GB, proxy component: false)

So how does that single component in the RAID-5 object having a vote of 2 help us with object availability? The short answer is that it does not. If this is a 4 node cluster, and if the cluster partitions 50/50 with 2 nodes in one partition and 2 nodes in another partition, this RAID-5 object would be inaccessible. The voting cannot help here. The site with the 3 votes could still not make the VM accessible. Having quorum is a prerequisite to make an object available, but it is not sufficient. The assigning of two votes to one component is a vSAN thing, which always tries to make the number of votes odd.

I decided to extrapolate this a little bit further, and look at the vote count when I deploy a VM on a vSAN stretched cluster. I set a policy that had the primary FTT = 1 using RAID-1 and the secondary FTT = 1 using RAID-5. This is a new feature of vSAN 6.6, which allows us to not only tolerate a complete site failure, but also to tolerate failures from within a single site.

Here is the configuration:

Disk backing: [vsanDatastore] c129e258-2378-637b-0ad7-246e962f4910/winA-01.vmdk
    DOM Object: c329e258-fa6d-d0af-685c-246e962f4910 (v5, owner: esxi-dell-n.rainpole.com, 
      proxy owner: esxi-dell-g.rainpole.com, policy: spbmProfileId = 667dbd61-1742-46a3-a88a-1eb2b59cc7e9, 
      forceProvisioning = 0, hostFailuresToTolerate = 1, spbmProfileName = Stretch-Cluster-R5, 
      cacheReservation = 0, subFailuresToTolerate = 1, CSN = 110, stripeWidth = 1, proportionalCapacity = 0, 
      SCSN = 128, replicaPreference = Capacity, spbmProfileGenerationNumber = 3)
      RAID_1
        RAID_5
          Component: 9805ea58-f620-2199-2264-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-m.rainpole.com, 
                     md: naa.500a075113019b33, ssd: t10.ATA_____Micron_P420m2DMTFDGAR1T4MAX______________0000000014170C1BAA51,
                     votes: 2, usage: 3.0 GB, proxy component: false)
          Component: 9805ea58-9fc2-2499-c051-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-p.rainpole.com, 
                     md: naa.500a075113019919, ssd: t10.ATA_____Micron_P420m2DMTFDGAR1T4MAX______________0000000014170C1BA973,
                     votes: 1, usage: 3.0 GB, proxy component: false)
          Component: 9805ea58-bd6d-2799-a84d-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-o.rainpole.com, 
                     md: naa.500a075113019b40, ssd: naa.500a075113019933,
                     votes: 1, usage: 3.0 GB, proxy component: false)
          Component: 9805ea58-906c-2999-9a52-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-n.rainpole.com, 
                     md: naa.500a07511301a1e8, ssd: naa.500a07511301a1e7,
                     votes: 1, usage: 3.0 GB, proxy component: false)
        RAID_5
          Component: 9805ea58-67c9-2b99-59cd-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-g.rainpole.com, 
                     md: naa.500a07510f86d6c6, ssd: naa.5001e82002675164,
                     votes: 1, usage: 3.0 GB, proxy component: true)
          Component: 9805ea58-edf0-2d99-430e-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-e.rainpole.com, 
                     md: naa.500a07510f86d685, ssd: naa.55cd2e404c31f86d,
                     votes: 1, usage: 3.0 GB, proxy component: true)
          Component: 9805ea58-2aa6-2f99-f4e8-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-h.rainpole.com, 
                     md: naa.500a07510f86d6bc, ssd: naa.55cd2e404c31f898,
                     votes: 1, usage: 3.0 GB, proxy component: true)
          Component: 9805ea58-4b3b-3299-57c3-246e962f48c0 (state: ACTIVE (5), host: esxi-dell-f.rainpole.com, 
                     md: naa.500a07510f86d6b3, ssd: naa.5001e82002664b00,
                     votes: 1, usage: 3.0 GB, proxy component: true)
      Witness: eb09ea58-1535-69bc-df69-246e962f48c0 (state: ACTIVE (5), host: witness-01.rainpole.com, 
               md: mpx.vmhba1:C0:T1:L0, ssd: mpx.vmhba1:C0:T2:L0,
               votes: 4, usage: 0.0 GB, proxy component: false)

So we can see once more that on site 1, there is a single component with a second vote. On site 2, there is not and all components have a single vote. Again, this extra vote doesn’t play a role. This is just vSAN looking to make the number of votes odd once more. In a stretched cluster, there would be no 50/50 partition scenario because we have three fault domains. Therefore the decision is always based on which partition has the majority of fault domains in a partition scenario. In vSAN stretched cluster, the witness always joins with the “preferred” site to make the majority fault domain scenario.

So, to recap, vSAN always deploys objects with an odd number of votes, though it’s not really needed in many scenarios such as the RAID-5 shown above. However it does not impact vSAN ability to make quorum decisions, and the extra votes are not an issue in any way.