I caught up with my good friends Wen Yu and Devin Hamilton at VMworld. Both have recently joined the Datrium team. As well as getting an overview of their products and features, and what problems they were trying to solve for customers with their solution, we also spoke about the general vision of Datrium. I was also privileged to talk with Boris and Brian recently. Hopefully this article will give readers a good understanding of Datrium.
In a recent interview, one of the founders (Boris) shared the thought process behind the creation of Datrium. He was working on some core vSphere features at VMware (he worked on many) and encountered a number of storage related bottlenecks during his development work. Some of these bottlenecks were performance related, and they had to engage with the storage company to try to troubleshoot and resolve these issues. He was also looking to test these core features at scale, and that was when the cost bottleneck arose; his team simply couldn’t afford to purchase more storage hardware for this testing. These are the core tenets behind Datrium; addressing these two major issues of cost and performance.
In discussions with Wen and Devin, it would appear that Datrium didn’t want to go with a complete hyper-converged system. These systems, which for the most part scale linearly, sometimes result in lots of spare storage capacity as the compute part is scaled out, or similarly, lots of spare and redundant compute capacity as the storage capacity is scaled out. These HCI (hyper-converged infrastructure) systems are intrinsically tied together with stateful hosts storing durable data. Datrium say that a frequent complaint they hear is that it’s difficult to isolation performance issues at the VM level because I/Os have multi-host dependencies (VMs are host specific but HCI system share resources across hosts, so they are chattier, and troubleshooting and tuning is more complex). Datrium wanted to come up with what they believe to be a better scaling solution.
The DVX architecture is at the core of the Datrium design model It is composed of two distinct components, the hardware appliance (NetShelf) and the DiESL (pronounced diesel) software. The sole purpose of the hardware appliance is to provide cheap, durable storage that is easy to manage. There are no data services running on the NetShelf appliances (RAID, cloning, compression, deduplication). It is basically a hardware appliance for durable data, and is fully redundant with no single points of failure (SPOF). Connectivity between the NetShelf and ESXi hosts is via a 10Gb link. The DiESL data services run on the ESXi hosts, but it should be noted that the ESXi hosts are completely stateless (all VM writes go to NVRAM on the NetShelf) so there is no issue with running core vSphere features on the ESXi hosts, such as DRS and vSphere HA. Additionally, by having stateless hosts with write isolation, it’s easier to troubleshoot, tune and service. At this point in time, Datrium supports ESXi hosts connecting to a single NetShelf. The plan is to support up to 32 hosts per netshelf initially, but my understanding is that the desire is to keep this number in sync with the DRS cluster size going forward.
Another hardware component is the inclusion of flash devices on the ESXi hosts, which provide read cache acceleration. This is a write-through cache, for accelerating reads. Writes go directly to the NetShelf, and are only acknowledged when they are committed on the mirrored NVRAM. However, a flash cache is a requirement for the Datrium solution. Datrium feels that this could be an excellent solution to hosts that do not have a lot of local drive bays, such as blade servers.
The DiESL software uses a slice of hypervizor resources to deliver the acceleration and other data services. It consumes two CPU cores, one for co-ordination and one for speed. It also requires 7.5GB RAM for a flash size/cache acceleration size of 1TB. An additional 1GB RAM is required for every additional 1TB of flash/cache on top of this.
The DiESL, Distributed Execution Shared Log, is the name of the Datrium filesystem. Part of DiESL resides on the netshelf, orchestrating I/O. The filesystem is designed to be very sequential in nature. For this reason, if there is a cache miss, then fetching the block from the filesystem is extremely fast. The majority of the DiESL component resides on the ESXi host. At the moment, the only supported hypervizor is VMware ESXi. Datrium installs the components as a VIB and this allows them to do all of the data services one would expect on a storage solution, such as performance acceleration, compression, deduplication, RAID levels (and rebuilds) as well as the orchestration of snapshots. In essence, it is a filesystem.
The DVX architecture presents the combined local cache/flash devices and netshelf as a single shared NFS v3 datastore to your ESXi hosts.
Commodity flash/bring your own SSDs
This is possibly the most interesting angle I find with Datrium. They are allowing their customers to purchase their own commodity flash for the hypervizor hosts. This is not sold by Datrium. Customers can pick it up from anywhere. So now, not only do you have cache for acceleration close to the hypervizor, but it can also be really cheap. A concern might be raised about the use of consumer grade flash. However Datrium say that they have designed the system from the ground up to use consumer grade flash, and that there are integrity checks and check-sum used throughout to ensure that there is no data corruption. Datrium support up to 8TB of physical flash capacity per host. When coupled with inline dedupe and compression, Datrium claim that the cost of local flash acceleration for their systems is minimal.
The obvious next question is about the purpose of the cache. Is it read cache (write-through) or if it is read and write (write-back) cache? It is read cache at the moment, so if there is a failure, the data is still persisted in the NetShelf. No need to worry about mirroring the cache then. This is what Datrium mean when they say that the hosts are stateless.
So what happens when there is a vMotion of a VM between hosts. Since the cache is local, won’t there be a performance dip while the cache warms on the new host? This is where things start getting interesting. Datrium’s technology will reach back to the cache on the original source host for a block, whilst the cache on the new destination host is still warming. So vMotion operations have been optimized in the DVX architecture.
A similar question arises around maintenance mode. If a host is placed in maintenance mode and the VM are moved to the remaining hosts in the cluster, so long as the source server is still reachable, the VMs can still reach back to the original source flash for data blocks.
Another interesting point that comes out of this is support. Datrium provide full support for the own back-end NetShelf appliances, as you might imagine. However servers and storage controllers used for the hypervizor must be on the VMware HCL. As for SSD support, in additional to the ones listed in VMware HCL, there will be a compatibility guide from Datrium listing validated consumer grade flash devices.
Everything is managed from a vCenter plugin via the vSphere client. The UI that Datrium developed is very VM-centric, with the VM being a first class object. As you can see here, the UI displays per VM statistics:
Datrium claim that with their DVX architecture, the vision is to allow you to scale performance independent of capacity. If you need more compute, just add another server, and they will take care of pushing out the Datrium software (VIB) to that host, and mounting the NFS datastore to it. If you need more performance, simply add more SSDs to a given host, or add another host with flash cache to the ESXi cluster. If a given host is running low on headroom, simply vMotion/migrate VMs to another host in the cluster where there are additional resources, whilst maintaining the current level of performance.