As many of you are aware, VMware made a number of announcements at VMworld 2012. There were three technical previews in the storage space. The first of these was on Virtual Volumes (VVols), which is aimed at making storage objects in virtual infrastructures more granular. The second was Virtual SAN (VSAN), previously known as Distributed Storage, a new distributed datastore using local ESXi storage. The final one was Virtual Flash (vFlash). However, rather than diving into vFlash, I thought it might be more useful to take a step back and have a look at flash technologies in general.
Recently, (in the past couple of years) we are seeing Solid State Drives (SSD) emerge in the enterprise, either as a cache, or in a storage tier or indeed in an all-flash storage arrays. SSDs are based on NAND flash. We are now beginning to see SSD used in storage infrastructures, which more recently were dominated by spinning disks (HDD). Historically, we have seen DRAM used as a cache by many storage vendors to provide additional performance over and above what spinning disk can provide. So what are the major differences between DRAM & NAND flash. Well, first off DRAM is volatile and so must be backed up with a battery/capacitor so that in the event of a power failure, the contents can be drained to a persistent storage layer. DRAM need power to maintain its contents. The advantage of NAND flash is that it is non-volatile. A second major difference is the fact that NAND flash will eventually wear out. There are only so many times you can write & erase it. There is significant research & development given over to extending the lifespan of NAND flash – I’ll cover this in more detail later. Another difference is that DRAM is significantly faster than NAND flash. And finally, as one might expect, DRAM is far more expensive than NAND flash.
Flash as a cache
When flash is used as a cache, there are two main terms used to describe its usage. Either it is a write through cache (read cache) or a write back cache (write cache). The write through cache works by storing the write in cache but the write is not acknowledge until it has also been committed to persistent storage. This means that there is no performance gain for the write, but subsequent reads can be retrieved from cache, making those operations much faster. Thus it is referred to as a read cache. Write back on the other hand sends an acknowledgement back when the write is received in cache. This dramatically speeds up both read and write performance. The writes are destaged periodically to persistent storage. The issue with write back is that there is a period of time when the writes are still in cache but not on disk, and if there is a failure, you risk loss of data or data corruption.
NAND Flash Types
In single-level cell (SLC) NAND flash technology, each cell can store a single bit of information. In multi-level cells (MLC) NAND flash uses multiple levels per cell to allow more bits to be stored. An MLC can usually have 4 values (00, 01, 10, 11). MLC is cheaper than SLC, but has a shorter life-span. Because MLC uses the same number of transistors as SLC, there is a higher risk of errors. To address this, another type of NAND flash, enterprise MLC (eMLC), has been designed to reduce the risk of errors. Basically eMLC is a middle ground between the cost/lifespan of SLC & MLC.
There are some opinions on the lifespan of various NAND flash types but in general, these are what one could expect as an agreed approximation.
SLC: 100,000 write cycles, most expensive
MLC: 3,000 – 10,000 write cycles, least expensive
eMLC: 20,000 – 30,000 write cycles, somewhere in between
What other considerations are there with NAND flash? Well, since each time a cell is written to, it first has to be erased, wear levelling, amplified write and garbage collections are all terms you will hear from storage vendors as they try to address this characteristic of NAND flash.
This feature ensures that all of the NAND flash cells are equally used. This will avoid overloading a single cell, which could lead to that cell ‘wearing out’ and eventually to media errors on the SSD. If your SSD supports SMART, vSphere 5.1 ships with a smartd which will allow you to monitor the wear level.
This is where the amount of data that gets written to an SSD is actually a multiple of the existing data. This is the result of a NAND flash cell needing to be erased before it can be rewritten. To avoid this overhead, most flash algorithms will always choose new cells for new writes (writes usually traverse many cells). This means that the old data still exists in the original cells and will need to be cleaned up at a later date by a process called garbage collection. But until that happens, you can have multiple copies of stale data & one version of valid data on the SSD.
Write amplification and garbage collection are closely related. Garbage collection is all about erasing blocks of NAND flash cells. To that end, the garbage collection algorithm may need to move cells containing valid data from one block to another block. Take an example where a write has stored data in half the cells in a block. An updated write for the same data comes in. This new data is written to a bunch of empty/erased cells in the same block and of course it now makes the data in the first set of cells stale. So now you have a block half full of stale cells which need erasing. This is where the garbage collection comes into play. It will move the valid cells to another block, and erase all the cells in that block so that it can be reused for future writes. As one can imagine, the algorithms used by the garbage collector play a significant role in extending the life-span of your SSD.
PCIe vs SSD
I guess the next question is related to where one should use flash? Should I use it in local PCIe cards or should I use it in a storage array? This is going to be one of those dreaded ‘it depends’ type of answers I’m afraid. And it does depend on what you need. Many PCIe cards also appear as SCSI disks to a host, although there are some notable exceptions to that rule. If you want the lowest possible latency, then something like a PCIe flash card might be desirable, where performance features like Direct Memory Access can be used for I/O. With storage arrays, you may need multiple HBAs or initiators to get the same sort of throughput, but latency might be higher. However, PCIe cards are implemented on a per host basis, so you might find that you are unable to leverage certain features like vMotion, DRS, etc. Now I know some PCIe flash vendors are doing a lot of work around this, so this might change too. And of course, the storage array approach allows the flash to be shared to multiple ESXi hosts over an FC or ethernet interconnect, whereas each ESXi host would need it owns PCIe flash card.
To date, VMware has done very little around SSD. We only have the smartd mentioned previously and the swap to SSD feature. Duncan does a great write-up about it here. Those of you who have been through the tech preview article on Distributed Storage will have read about SSD being used as a cache layer. Now we do have another project called vFlash, which is to integrate vSphere with flash technologies in greater detail. I’ll follow-up with a post about it on the vSphere blog very soon.
Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @CormacJHogan