vSAN Erasure Coding Failure Handling

Cormac

5 years ago

I had a very interesting question recently about how vSAN handles a failure in an object that is running with an erasure coding configuration. In the case of vSAN this is either a RAID-5 or a RAID-6. On vSAN, a RAID-5 is implemented with 3 data segments and 1 parity segment (3+1), with parity striped across all four components. RAID-6 is implemented as 4 data segments and 2 parity segments (4+2), again with the parity striped across all of the six components. Now, on vSAN, RAID-5 requires 4 physical ESXi hosts for implementation, with each host backing one set of components. Similarly, RAID-6 requires 6 ESXi hosts for the same reason. So what happens when we need to continue writing to one of these objects after a component/segment/disk/host has failed.

After discussing this with one of our vSAN engineering leads, the answer is that it depends on which offset you are writing to. Let’s take RAID-5 as an example. The RAID-5 VMDK object address space is split into 1 MB stripes. If we take the 3 of the 4 RAID-5 components that are used for data together, this makes up one contiguous 3 MB range. We refer to this as a row which is distributed over the three components. The fourth component is used for parity. The component used for the parity gets rotated for each row.

Let’s take a look at a simplified RAID-5 VMDK object address space, where we are examining the first 15MB. Data is written a 1MB per component.

  Comp1   Comp2   Comp3   Comp4
|-0--1-||-1--2-||-2--3-||PARITY|
|-3--4-||-4--5-||PARITY||-5--6-|
|-7--8-||PARITY||-8--9-||-9-10-|
|PARITY||-10-11||-11-12||-12-13|

Let’s now assume that Comp3 (component 3) has failed. In other words, the storage device or host backing all of the Comp3 data and parity has failed. What happens to the writes now?

First, we will look at a row that has a data component impacted when Comp3 failed. Let’s take the first row. Future writes to the 0-2 MB range in the object address space will be unaffected. They will still go to their respective data component (either Comp1 or Comp2). Writes to the 2-3 MB range will read data from Comp1 and Comp2, calculate the new parity based on all 3 data components, and then write parity in Comp4. But of course there cannot be a write to Comp3 as it is now failed/missing. This same procedure applies to all other rows that are missing data components due to a failure of Comp3.

Let’s now look at a row that has lost its parity component when Comp3 failed, for example, row 2. Writes to the 3-6 MB range will just write the data to Comp1, Comp2 and Comp4 as normal with no parity. Hence there are no parity reads associated with this write operation. In this case there is a reduction in the amount of IO amplification involved. For RAID-5 writes, we would typically have to read the existing data and parity, write back the new data, calculate the new parity and write it back. Now, with rows that have parity on the failed component, the reads and writes will not be amplified. In fact, as we have seen, reads and writes are decreased from 2 to 1 in cases where parity on the affected component.

So, to recap, we still maintain a 3+1 RAID-5 arrangement for data placement, but there is a “functional repair” whereby we include the data that cannot be written in the parity calculation. We can then use that parity (with the other two data components Comp1 and Comp2) to reconstruct the original data if we need to service a guest read, or of course to resync to Comp3 when it recovers.