vSAN Erasure Coding Failure Handling

I had a very interesting question recently about how vSAN handles a failure in an object that is running with an erasure coding configuration. In the case of vSAN this is either a RAID-5 or a RAID-6. On vSAN, a RAID-5 is implemented with 3 data segments and 1 parity segment (3+1), with parity striped across all four components. RAID-6 is implemented as 4 data segments and 2 parity segments (4+2), again with the parity striped across all of the six components. So what happens when we need to continue writing to one of these objects after a component/segment has failed.

After discussing this with one of our vSAN engineering leads, the answer is that it depends on which offset you are writing to.  Let’s take RAID-5 as an example. The RAID-5 VMDK object address space is split into 1 MB stripes.  If we take the 3 of the 4 RAID-5 components together, this makes up one contiguous 3 MB range. We refer to this a row which is distributed over the three components. The fourth component is used for parity.  The component used for the parity gets rotated for each row.

Let’s take a look at a simplified RAID-5 VMDK object address space, where we are examining the first 15MB. Data is written a 1MB per component.

  Comp1   Comp2   Comp3   Comp4
|-0--1-||-1--2-||-2--3-||PARITY|
|-3--4-||-4--5-||PARITY||-5--6-|
|-7--8-||PARITY||-8--9-||-9-10-|
|PARITY||-10-11||-11-12||-12-13|

Let’s now assume that Comp3 has failed. What happens to the writes now?

First, we will look at a row that has lost a data component. Let’s take the first row. Future writes to the 0-2 MB range in the object address space will be unaffected.  They will still go to their respective data component (either 1 or 2). Writes to the 2-3 MB range will read data from Comp1 and Comp2, calculate the new parity based on all 3 data components, and then write parity in Comp4. But of course there cannot be a write to Comp3 as it is now failed/missing. This same procedure applies to all other rows that are missing data due to a failure of Comp3.

Let’s now look at a row that has lost its parity component, for example, row 2. Writes to the 3-6 MB range will just write the data to Comp1, Comp2 and Comp4 as normal with no parity. Hence there are no parity reads associated with this write operation. In this case there is a reduction in the amount of IO amplification involved. For RAID-5 writes, we would typically have to read the existing data and parity, write back the new data, calculate the new parity and write it back. Now, with rows that have parity on the failed component, the reads and writes will not be amplified. In fact, as we have seen, reads and writes are decreased from 2 to 1 in cases where parity on the affected component.

So, to recap, we still maintain a 3+1 RAID-5 arrangement for data placement, but there is a “functional repair” whereby we include the data that cannot be written in the parity calculation. We can then use that parity (with the other two data components Comp1 and Comp2) to reconstruct the original data if we need to service a guest read, or of course to resync to Comp3 when it recovers.

3 comments
  1. Interesting. Does the 1MB size apply only to RAID 5 or also to RAID6, and stripes in general ? Is there any doc that shows, e.g, how is data for a RAID-0 with 4 stripes distributed in components ? Thanks!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.