Introduction
Boaz Palgi and Erez Webman, two of the original team who founded ScaleIO, delivered the presentation I received from ScaleIO. ScaleIO’s initial research found that many companies who had multiple data centers distributed around the globe are now starting to consolidate into a few mega data-centers. Also, they found that many small to midsized businesses (SMB) do not want to run their own data-centers anymore. They either want to outsource or run their business in hosted/cloud environments. ScaleIO spotted an opportunity to build a converged, scale-out solution that would run both virtual and bare metal on the converged storage & compute tier, and scale to hundreds or thousands of nodes to meet those new customers requirements.
Architecture
There are two main components in ScaleIO, both of which run alongside the applications, hypervisors, and databases on standard servers. The first of these is the ScaleIO Data Client or SDC. This is installed on a server node. Applications on the server talk to the local SDC which intercepts the applications I/O requests. The second component is the ScaleIO Data Server or SDS. This is installed on each server node that contributes local storage to the cluster. The SDS is responsible for disk I/O and is responsible for making application data highly available across the cluster.
Once this configuration is in place, data volumes can then be created. These volumes may be distributed across all the nodes that are participating in the ScaleIO cluster. Volumes are split up into ‘chunks’ and it is these chunks that are distributed around the local storage of each of the nodes/hosts participating in the cluster. The chunks are also mirrored across all the disks in the cluster in a balanced fashion, making it both highly available and producing extremely high performance.
ScaleIO also has the concept of a Protection Domain. A protection domain is simply a group of SDSs. This allows a multi-node cluster to be split up into failure domains. When a volume is created, it can be guard-railed by a protection domain, and consume only resources from nodes in that domain. However all nodes in the cluster can still access the volume, but the volume resides in that protection domain only. Nodes may also be moved in and out of protection domains as the need arises, providing elasticity in the cluster.
One interesting point is that there is no requirement to have uniformly configured nodes in the cluster. It seems any configuration of nodes can be supported (CPU, memory, network, size of disk, number of disk, speed of disk) and these will all work together quite happily in the cluster.
As mentioned, ScaleIO nodes use local storage. However Erez stated that nodes only need access to block storage, so it could feasibly sit in front of SAN storage if required. However, Erez didn’t see any use cases for such a configuration.
From a component failure perspective, ScaleIO uses a mesh mirroring mechanism which means that all nodes can be involved in the rebuild process in the event of a failure. This means that in a 100 nodes cluster, 100 nodes can be involved in the rebuild process, making it very swift.
Data Services
ScaleIO already comes with a set of enterprise class data services. This includes snapshots that are fully writable & which support ‘consistency groups’. Through the use of protection domains and storage pools, ScaleIO can offer multi-tenancy with segregated data. They also offer lightweight data encryption of data at rest. There is also QoS on a per volume basis which means you can limit the number of IOPS and/or bandwidth per application. Their storage layer offers multi-tiering using a combination of PCIe, SSD & HDD as well as auto-balancing of components across the storage as new nodes/storage is added or removed from the cluster.
Another interesting feature is ScaleIO’s ability to deliver auto-tiering by running a third-party flash cache software product, EMC’s XtremCache software for example.
Something which is currently missing (but which is on the road-map) is a Disaster Recovery (DR) solution. Admittedly, in a vSphere environment, vSphere Replication with SRM orchestration is an option that can be offered to customers. However Erez stated that the ScaleIO team come from a DR background and that a working DR solution with low Recover Point Objective (RPO) and async/sync replication options is high on their list of priorities.
VMware Interoperability
Volumes, when created, are presented back to the ESXi hosts as iSCSI LUNs. These may then be used as Raw Device Mappings (RDMs) or formatted as VMFS datastores. Virtual Machine deployment can then begin in earnest. The ScaleIO VM runs OpenSUSE from what I can see, and needs around 1GB memory and 1-2 vCPUs. You will need a minimum of three physical ESXi hypervisor servers to run ScaleIO, but obviously the idea is that you can scale up to the 100s or 1000s of ESXi hypervisor servers. Let’s look at the core vSphere features and how they inter-operate in more detail.
- ScaleIO fully supports vSphere HA. As long as the ScaleIO iSCSI volume remains available, then the VM files are still available, so no reason to be concerned about the fact that the underlying data is distributed.
- ScaleIO supports vMotion. VMs can be migrated between hosts/nodes.
- ScaleIO supports DRS, which enables automated vMotions to allow balancing of VM (CPU/Memory) across all ESXi hosts, also works in a ScaleIO cluster.
- One can migrate a VM deployed on ScaleIO to another datastore and vice-versa via Storage vMotion.
- Storage DRS has the ability to balance VMs across multiple ScaleIO volumes in a datastore cluster. I suppose one question here related to Storage DRS balancing on I/O metrics and having ScaleIO QoS set on the datastores. My take is that ScaleIO can set a QoS based on IOPS, but then Storage DRS can balance based on latency (although SDRS I/O metrics balancing may never kick-in if the cluster is well balanced and the workloads in the VMs are well behaved).
- Storage I/O Control (SIOC) is designed to address the noisy neighbour issue has the same considerations as SDRS I/O metrics. In a well balanced ScaleIO using QoS, it may prove unnecessary, unless some workloads begin to misbehave.
- DPM – Distributed Power Management – should be left off. DPM has the ability to power off ESXi hosts when there are no running VMs. Of course in both VSAN and ScaleIO clusters, this would be a serious problem as data may be hosted on the ESXi, but no running VMs could be on the host. So don’t use this feature.
- Fault Tolerance can been configured and works on ScaleIO. I’m not sure that the support statement is however.
Conclusion
Once again, we see a very nice converged and scale out solution on the market. This product is already GA with version R1.20 available to customers. My understanding is that the next release R1.30 will have a number of new features, one of which is to have tighter integration with EMC’s host based flash products.
I hope this has given you some food for thought. Certainly there seems to be quite a change underway from scale-up storage arrays towards scale-out compute and storage. And as Chad says, customers have a choice. If you are looking for a converged, scale out solution in a non-vSphere environment or in a vSphere &/or other hypervizor &/or physical, it is certainly worth your while taking a closer look at ScaleIO.
Special thanks to Chad Sakac and Matt Cowger of EMC for fielding many of the questions I had with interop after the presentation.