A closer look at EMC ScaleIO

scaleioThanks to our friends at EMC, I was recently given the chance to attend a session on EMC’s new storage acquisition, ScaleIO. This acquisition generated a lot of interest (and perhaps some confusion) as VMware Virtual SAN product seemed to play in that same storage area. My good friend Chad Sakac over at EMC wrote about this some 6 months ago in his evocatively titled blog post VSAN vs. ScaleIO fight! Chad explains where, in his opinion, each product can be positioned and how EMC/VMware customers have a choice of storage options. His article is definitely worth a read.  I wanted to learn more about the ScaleIO product and share this with you.

Introduction

Boaz Palgi and Erez Webman, two of the original team who founded ScaleIO, delivered the presentation I received from ScaleIO. ScaleIO’s initial research found that many companies who had multiple data centers distributed around the globe are now starting to consolidate into a few mega data-centers. Also, they found that many small to midsized businesses (SMB) do not want to run their own data-centers anymore. They either want to outsource or run their business in hosted/cloud environments. ScaleIO spotted an opportunity to build a converged, scale-out solution that would run both virtual and bare metal on the converged storage & compute tier, and scale to hundreds or thousands of nodes to meet those new customers requirements.

Architecture

There are two main components in ScaleIO, both of which run alongside the applications, hypervisors, and databases on standard servers. The first of these is the ScaleIO Data Client or SDC. This is installed on a server node. Applications on the server talk to the local SDC which intercepts the applications I/O requests. The second component is the ScaleIO Data Server or SDS. This is installed on each server node that contributes local storage to the cluster. The SDS is responsible for disk I/O and is responsible for making application data highly available across the cluster.

Once this configuration is in place, data volumes can then be created. These volumes may be distributed across all the nodes that are participating in the ScaleIO cluster. Volumes are split up into ‘chunks’ and it is these chunks that are distributed around the local storage of each of the nodes/hosts participating in the cluster. The chunks are also mirrored across all the disks in the cluster in a balanced fashion, making it both highly available and producing extremely high performance.

ScaleIO also has the concept of a Protection Domain. A protection domain is simply a group of SDSs. This allows a multi-node cluster to be split up into failure domains. When a volume is created, it can be guard-railed by a protection domain, and consume only resources from nodes in that domain. However all nodes in the cluster can still access the volume, but the volume resides in that protection domain only. Nodes may also be moved in and out of protection domains as the need arises, providing elasticity in the cluster.

One interesting point is that there is no requirement to have uniformly configured nodes in the cluster. It seems any configuration of nodes can be supported (CPU, memory, network, size of disk, number of disk, speed of disk) and these will all work together quite happily in the cluster.

As mentioned, ScaleIO nodes use local storage. However Erez stated that nodes only need access to block storage, so it could feasibly sit in front of SAN storage if required. However, Erez didn’t see any use cases for such a configuration.

From a component failure perspective, ScaleIO uses a mesh mirroring mechanism which means that all nodes can be involved in the rebuild process in the event of a failure. This means that in a 100 nodes cluster, 100 nodes can be involved in the rebuild process, making it very swift.

Data Services

ScaleIO already comes with a set of enterprise class data services. This includes snapshots that are fully writable & which support ‘consistency groups’. Through the use of protection domains and storage pools, ScaleIO can offer multi-tenancy with segregated data. They also offer lightweight data encryption of data at rest. There is also QoS on a per volume basis which means you can limit the number of IOPS and/or bandwidth per application. Their storage layer offers multi-tiering using a combination of PCIe, SSD & HDD as well as auto-balancing of components across the storage as new nodes/storage is added or removed from the cluster.

Another interesting feature is ScaleIO’s ability to deliver auto-tiering by running a third-party flash cache software product, EMC’s XtremCache software for example.

Something which is currently missing (but which is on the road-map) is a Disaster Recovery (DR) solution. Admittedly, in a vSphere environment, vSphere Replication with SRM orchestration is an option that can be offered to customers. However Erez stated that the ScaleIO team come from a DR background and that a working DR solution with low Recover Point Objective (RPO) and async/sync replication options is high on their list of priorities.

VMware Interoperability

jigsawScaleIO is positioning themselves in many different use cases. Virtualization is only one of those. So what is needed to enable ScaleIO in a vSphere environment? Well, there is no integration directly into the VMkernel like VMware’s Virtual SAN (VSAN), although they do have this functionality with other hypervisors. For ESXi, ScaleIO delivers a ready-to-go virtual machine including all relevant ScaleIO components. A single ScaleIO VM is required per ESXi host for ScaleIO to work in a vSphere environment.

Volumes, when created, are presented back to the ESXi hosts as iSCSI LUNs. These may then be used as Raw Device Mappings (RDMs) or formatted as VMFS datastores. Virtual Machine deployment can then begin in earnest. The ScaleIO VM runs OpenSUSE from what I can see, and needs around 1GB memory and 1-2 vCPUs. You will need a minimum of three physical ESXi hypervisor servers to run ScaleIO, but obviously the idea is that you can scale up to the 100s or 1000s of  ESXi hypervisor servers. Let’s look at the core vSphere features and how they inter-operate in more detail.

  • ScaleIO fully supports vSphere HA. As long as the ScaleIO iSCSI volume remains available, then the VM files are still available, so no reason to be concerned about the fact that the underlying data is distributed.
  • ScaleIO supports vMotion. VMs can be migrated between hosts/nodes.
  • ScaleIO supports DRS, which enables automated vMotions to allow balancing of VM (CPU/Memory) across all ESXi hosts, also works in a ScaleIO cluster.
  • One can migrate a VM deployed on ScaleIO to another datastore and vice-versa via Storage vMotion.
  • Storage DRS has the ability to balance VMs across multiple ScaleIO volumes in a datastore cluster. I suppose one question here related to Storage DRS balancing on I/O metrics and having ScaleIO QoS set on the datastores. My take is that ScaleIO can set a QoS based on IOPS, but then Storage DRS can balance based on latency (although SDRS I/O metrics balancing may never kick-in if the cluster is well balanced and the workloads in the VMs are well behaved).
  • Storage I/O Control (SIOC) is designed to address the noisy neighbour issue has the same considerations as SDRS I/O metrics. In a well balanced ScaleIO using QoS, it may prove unnecessary, unless some workloads begin to misbehave.
  • DPM – Distributed Power Management – should be left off. DPM has the ability to power off ESXi hosts when there are no running VMs. Of course in both VSAN and ScaleIO clusters, this would be a serious problem as data may be hosted on the ESXi, but no running VMs could be on the host. So don’t use this feature.
  • Fault Tolerance can been configured and works on ScaleIO. I’m not sure that the support statement is however.

Conclusion

Once again, we see a very nice converged and scale out solution on the market. This product is already GA with version R1.20 available to customers. My understanding is that the next release R1.30 will have a number of new features, one of which is to have tighter integration with EMC’s host based flash products.

I hope this has given you some food for thought. Certainly there seems to be quite a change underway from scale-up storage arrays towards scale-out compute and storage. And as Chad says, customers have a choice. If you are looking for a converged, scale out solution in a non-vSphere environment or in a vSphere &/or other hypervizor &/or physical, it is certainly worth your while taking a closer look at ScaleIO.

Special thanks to Chad Sakac and Matt Cowger of EMC for fielding many of the questions I had with interop after the presentation.

9 Replies to “A closer look at EMC ScaleIO”

  1. So when you provision a new iSCSI lun, is it presented from all heads in the cluster?

    Do you need to add all participating storage server nodes to be added as targets?

    How are the paths shown?

    Is there any clever ‘placement aware’ algorithms to direct a storage consumers IO path to its local gateway to avoid a possible double latency hit?

    1. 1) a new LUN is only accessible from the authorized consumers (SDC) for that. Now, often that would be 4-8 (or more) SDCs if you were to then export it as iSCSI to an ESXi farm.

      2) yes, each SDC needs to be a target

      3) just as regular paths in ESXi – no different than any other array.

      4). We recommend fixed pathing, with the host’s local SDC as that primary path for the lowest latency.

  2. Sounds really similar to Coraid or GridStore, in the way they all dropped common communications protocol (iSCSI or NFS) in favor of a custom driver installed directly into the server that needs to access resources from a backend storage cluster. Do they also have linux and windows driver? In gridstore for example is a miniport driver for windows…

    1. This is a little different, in that the backend cluster and the ‘client’ systems are very often the same systems. Unlike coraid, no reaching out to a dedicated JBOD on Layer 2 protocols…just another system via TCP/IP. It even works fine over routed networks.

      There are both windows and Linux drivers currently.

  3. I am a little confused here as You mentioned there are both windows and Linux drivers.

    Does SDC and SDS both run in a single open-Suse based VM? If so, is the I/O path something like, application->iscsi_initiator->iscsi target->SDC->SDS?

    or

    Does SDC run within the application server VM itself in which case, the I/O path would look like, application->SDC->iscsi-initiator->iscsi_target->SDS?

    1. The point is that ScaleIO is not reliant on vSphere -it can run on different hypervisors and on physical – that is why there are different drivers.

      To run it on vSphere, a Linux VM is deployed with the appropriate drivers. Once a volume is created, it is presented over iSCSI.

      Therefore the I/O path would from application running in the VM to the iSCSI volume presented by ScaleIO and then ScaleIO can do its mirroring/distribution to local storage.

      Hope than makes sense.

      1. Thanks for clarifying! It does clear things up a bit. So, in case of Windows Hyper-V environment, is there a “native” windows miniport (similar to Gridstore) or is the same Linux-VM deployed over there as well?

Comments are closed.