I was having some discussions recently on the community forums about Virtual SAN behaviour when a VM storage policy is changed on-the-fly. This is a really nice feature of Virtual SAN whereby requirements related to availability and performance can be changed dynamically without impacting the running virtual machine. I wrote about it in the blog post here. However there are some important considerations to take into account when changing a policy on the fly like this.
I was involved in an interesting case recently. It was interesting because the customer was running an 8 node cluster, 4 disk groups per host and 5 x ~900GB hard disks per disk group which should have provided somewhere in the region of 150TB of storage capacity (with a little overhead for metadata). But after some maintenance tasks, the customer was seeing only 100TB approximately on the VSAN datastore.
This was a little strange since the VSAN status in the vSphere web client was showing all 160 disks claimed by VSAN, yet the capacity of the VSAN datastore did not reflect this. So what could cause this behaviour?
As part of a quick reference proof-of-concept/evaluation guide that I have been working on, it has become very clear that one of the areas that causes the most confusion is what happens when a storage device is either manually removed from a host participating in the Virtual SAN cluster or the device suffers a failure. These are not the same thing from a Virtual SAN perspective.
To explain the different behaviour, it is important to understand that Virtual SAN has 2 types of failure states for components: ABSENT and DEGRADED.
There was a very interesting discussion on our internal forums here at VMware over the past week. One of our guys had built out a VSAN cluster, and everything looked good. However on attempting to deploy a virtual machine on the VSAN datastore, he kept hitting an error which reported that it “cannot complete file creation operation”. As I said, everything looked healthy. The cluster formed correctly, there were no network partitions and the network status was normal. So what could be the problem?
Over the past month or so, I’ve been looking at disaster recovery of some of the vCloud Suite components. My experiences of using vSphere Replication and Site Recovery Manager to protect and recover vCenter Operations Manager in the event of a disaster can be found here and here. Now it was time to look at vCenter Orchestrator (vCO) to see if that could be protected and recovered.
In this configuration, I deployed vCO in HA mode, meaning that there were two vCenter Orchestrator servers, one running and one in standby mode. The database for vCO was an external SQL Server database, running in its own VM. So there were three VMs to protect in this setup.
I have been doing a bunch of stuff around disaster recovery (DR) recently, and my storage of choice at both the production site and the recovery site has been VSAN, VMware Virtual SAN. I have already done a number of tests already with products like vCenter Server, vCenter Operations Manager and NSX, our network virtualization product. Next up was VCO, our vCenter Orchestrator product. I set up vSphere Replication for my vCO servers (I deployed them in a HA configuration) and their associated SQL DB VM on Friday, but when I got in Monday morning, I could not log onto my vCenter. The problem was that my vCenter was running on VSAN (a bit of a chicken and egg type situation), so how do I troubleshoot this situation without my vCenter. And what was the actual problem? Was it a VSAN issue? This is what had to be done to resolve it.
I’ve been working on some Disaster Recovery (DR) scenarios recently with my good pal Paudie. Last month we looked at how we might be able to protect vCenter Operation Manager, by using a vApp construct and also using IP customization. After VMworld, we turned our attention to NSX, and how we might be able to implement a DR solution for NSX. This is still a work in progress, but we did learn some very useful NSX troubleshooting commands that I thought would be worth sharing with you.