vSAN Erasure Coding Failure Handling

I had a very interesting question recently about how vSAN handles a failure in an object that is running with an erasure coding configuration. In the case of vSAN this is either a RAID-5 or a RAID-6. On vSAN, a RAID-5 is implemented with 3 data segments and 1 parity segment (3+1), with parity striped across all four components. RAID-6 is implemented as 4 data segments and 2 parity segments (4+2), again with the parity striped across all of the six components. So what happens when we need to continue writing to one of these objects after a component/segment has failed.

After discussing this with one of our vSAN engineering leads, the answer is that it depends on which offset you are writing to.  Let’s take RAID-5 as an example. The RAID-5 VMDK object address space is split into 1 MB stripes.  If we take the 3 of the 4 RAID-5 components together, this makes up one contiguous 3 MB range. We refer to this a row which is distributed over the three components. The fourth component is used for parity.  The component used for the parity gets rotated for each row.

Let’s take a look at a simplified RAID-5 VMDK object address space, where we are examining the first 15MB. Data is written a 1MB per component.

  Comp1   Comp2   Comp3   Comp4

Let’s now assume that Comp3 has failed. What happens to the writes now?

First, we will look at a row that has lost a data component. Let’s take the first row. Future writes to the 0-2 MB range in the object address space will be unaffected.  They will still go to their respective data component (either 1 or 2). Writes to the 2-3 MB range will read data from Comp1 and Comp2, calculate the new parity based on all 3 data components, and then write parity in Comp4. But of course there cannot be a write to Comp3 as it is now failed/missing. This same procedure applies to all other rows that are missing data due to a failure of Comp3.

Let’s now look at a row that has lost its parity component, for example, row 2. Writes to the 3-6 MB range will just write the data to Comp1, Comp2 and Comp4 as normal with no parity. Hence there are no parity reads associated with this write operation. In this case there is a reduction in the amount of IO amplification involved. For RAID-5 writes, we would typically have to read the existing data and parity, write back the new data, calculate the new parity and write it back. Now, with rows that have parity on the failed component, the reads and writes will not be amplified. In fact, as we have seen, reads and writes are decreased from 2 to 1 in cases where parity on the affected component.

So, to recap, we still maintain a 3+1 RAID-5 arrangement for data placement, but there is a “functional repair” whereby we include the data that cannot be written in the parity calculation. We can then use that parity (with the other two data components Comp1 and Comp2) to reconstruct the original data if we need to service a guest read, or of course to resync to Comp3 when it recovers.

  1. Interesting. Does the 1MB size apply only to RAID 5 or also to RAID6, and stripes in general ? Is there any doc that shows, e.g, how is data for a RAID-0 with 4 stripes distributed in components ? Thanks!

  2. can you please explain what would happen if complete one node failure in Erasure coding raid 5 (3+1) in 4 node cluster. will my application continue working, how data rebuilt happen.
    Please share if you have any document on that.

    • Hi Anil,
      Yes, if one node fails, your application will continue to work.
      Erasure Coding RAID-5 using XOR operations to create the parity. This parity is then used with the existing data to recreate any missing data when the issue is resolved and all 4 nodes are participating in the vSAN cluster.

Comments are closed.