This next part of the VSAN 6.2 series of posts focuses on an important feature which many customer have been requesting. VSAN 6.2 introduces another new feature, end-to-end software checksum, to help customers avoid data integrity issues arising due to problems on the underlying storage media. In VSAN 6.2, checksum is enabled by default, but may be enabled or disabled on per virtual machine/object basis via VM storage policies. Checksum is enabled by default as we feel customers will always want to leverage this great new feature. The only reason one might disable it is if the application already has this functionality included.
The new capability for checksum is called ‘Disable object checksum’. It may be selected, and disabled, when creating a VM Storage Policy as shown below. Otherwise it is always enabled.
Checksum on VSAN is implemented using the very common cyclic redundancy check CRC-32C (Castagnoli) for best performance, utilizing special CPU instructions on Intel processors. Every 4KB block will have a checksum associated with it. The checksum is 5 bytes in size. When the data is written, the checksum is verified on the same host where the data originates to ensure that if there is any corruption in-flight over the network, it is caught. The checksum is persisted with the data.
On a subsequent read of the data, if checksum is enabled, the checksum data is also requested. If the checksum reveals that the data block that was just read is in some way corrupted, then in the case of RAID-1 objects, the correct data is read from the other replica/mirror. In the case of RAID-5/RAID-6 objects, the data block is reconstructed from the other components in the RAID stripe. An error is also logged to the vmkernel.log file on the host that contains the device where the component erred, as well as on the host where the VM runs. In the example below, we deliberately overwrote a zero’ed out block of data with a random pattern of data, and then read the data via the Guest OS:
2016-02-16T07:31:44.082Z cpu0:33075)LSOM: RCDomCompletion:6706: \ Throttled: Checksum error detected on component \ a3fbc156-3573-4f2c-f257-0050560217f4 \ (computed CRC 0x6e4179d7 != saved CRC 0x0) 2016-02-16T07:31:44.086Z cpu0:33223)LSOM: LSOMScrubReadComplete:1958: \ Throttled: Checksum error detected on component \ a3fbc156-3573-4f2c-f257-0050560217f4, data offset 524288 \ (computed CRC 0x6e4179d7 != saved CRC 0x0) 2016-02-16T07:31:44.096Z cpu1:82528)WARNING: DOM: \ DOMScrubberAddCompErrorFixedVob:327: Virtual SAN detected and fixed a \ medium or checksum error for component \ a3fbc156-3573-4f2c-f257-0050560217f4 \ on disk group 521f5f1b-c59a-0fe2-bdc0-d1236798437c
Alongside the checksum verification on read operations, VSAN also has a scrubber mechanism which checks that the data on disk does not have any silent corruption. This scrubber is designed to check all of the data once a year, but this can be tuned via the advanced setting VSAN.ObjectScrubsPerYear to run more often. For instance, if you want this to check all of the data once a week, set this to 52, but be aware that there will be some performance overhead when this operation runs.
Checksum is fully supported with all of the new features, such as RAID-5/RAID-6, deduplication and compression and configurations such as VSAN stretched cluster. As mentioned, it is on by default so customers simply get the benefit without having to configure it. And if you find you don’t want it, for some reason or other, simply disable it in your VM Storage Policy as shown above. This feature will enable VSAN customers to detect data corruption, due to “latent sector errors” which are typically due to physical drive problems, or other silent data corruption.