Recovering from a full VSAN datastore scenario
We had an interesting event happen on one of our lab servers this weekend. One of the hosts in our four node cluster hit an issue, which meant that the storage on that host was no longer available to the VSAN datastore. Since VSAN auto-heals, it attempted to re-protect as many VMs as possible. However, since we chose to ignore one of the health check warnings to do with limits, we ended up with a full VSAN datastore.
This is the limits warning I am talking about. Note that it is reporting that there is not enough space to handle a node failure – we would require 120% of available storage to re-protect all of our VMs. Note to self: “don’t ignore this warning in future”:
This is our management cluster. It held our vCenter server, our DNS/AD server, and a bunch of other applications such as VDP and View. Anyway, to cut a long story short, we fixed the original problem, and the space issue got resolved. Now we were left in a situation that required us to restart VMs. Of course, when the VMs ran out of disk space, they got suspended. This is standard behaviour for VMs on all datastores, not just VSAN. We could tell that there were in this state by examining the vmware.log file of VMs such as the vCenter server and DNS/AD server:
2016-04-17T09:50:05.280Z| vmx| I120: Msg_Question:
2016-04-17T09:50:05.280Z| vmx| I120: [msg.hbacommon.outofspace] There is no \
more space for virtual disk mgmt-vc01.rainpole.com_6.vmd
2016-04-17T09:50:05.280Z| vmx| I120: ----------------------------------------
2016-04-17T09:50:25.278Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-04-17T09:54:05.650Z| vmx| I120: Timing out dialog 1510834
2016-04-17T09:54:05.651Z| vmx| I120: Vigor_MessageRevoke: message \
'msg.hbacommon.outofspace' (seq 1510834) is revoked
2016-04-17T09:54:05.651Z| vmx| I120: MsgQuestion: msg.hbacommon.outofspace reply=0
2016-04-17T09:54:05.687Z| vmx| I120: Msg_Question:
2016-04-17T09:54:05.687Z| vmx| I120: [msg.hbacommon.outofspace] There is no \
more space for virtual disk mgmt-vc01.rainpole.com_6.vmd
2016-04-17T09:54:05.687Z| vmx| I120: ----------------------------------------
2016-04-17T09:58:06.664Z| vmx| I120: Timing out dialog 1510835
Now since we no longer had access to our vCenter server, how do we respond to these messages? We decided to try to do this via the CLI using vim-cmd. This is how we did it, starting with the AD/DNS server, dc01.rainpole.com. First, we need the VM id. The following command provided us with that:
[root@esxi-hp-01:~] vim-cmd /vmsvc/getallvms
Vmid Name
15 vVNX [vsanDatastore] ...
16 vdp-01.rainpole.com [vsanDatastore] ...
20 dc01.rainpole.com [vsanDatastore] ...
22 appvolmgr.rainpole.com [vsanDatastore] ...
24 HCIBench [vsanDatastore] ...
Next, we displayed the message on the VM:
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20
Virtual machine message 6034178:
There is no more space for virtual disk \
dc01-rainpole.com.vmdk. You might be able to continue this \
session by freeing disk space on the relevant volume, \
and clicking Retry. Click Cancel to terminate this session.
0. button.retry (Retry) [default]
1. button.abort (Cancel)
The next step is to respond to the message. Note that the message id is 6034178.
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20 6034178 0
[root@esxi-hp-01:~]
Looks good. Let’s see if the message is still there. If it is, as shown below, you will have to run the same command again, but with the new message id. Note that you may have to repeat it a number of times.
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20
Virtual machine message 6034179:
There is no more space for virtual disk \
dc01-rainpole.com.vmdk. You might be able to continue this \
session by freeing disk space on the relevant volume, and \
clicking Retry. Click Cancel to terminate this session.
0. button.retry (Retry) [default]
1. button.abort (Cancel)
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20 6034179 0
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20
Virtual machine message 6034180:
There is no more space for virtual disk \
dc01-rainpole.com.vmdk. You might be able to continue this \
session by freeing disk space on the relevant volume, and \
clicking Retry. Click Cancel to terminate this session.
0. button.retry (Retry) [default]
1. button.abort (Cancel)
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20 6034180 0
[root@esxi-hp-01:~] vim-cmd /vmsvc/message 20
No message.
[root@esxi-hp-01:~]
So finally the message has cleared. The next step is to bring up vCenter using the same method. Once your vCenter server is up and running, you can connect to it with the client, and answer any other outstanding questions associated with your VMs.
Hi Cormac. Nice write up as always.
Would it be possible to clear messages and start VMs with ESXi Embedded Host Client? I hope so.
When you said, “Anyway, to cut a long story short, we fixed the original problem, and the space issue got resolved.” does this mean you added more storage for now? I think I had something similar happen but after deleting all my VMs (lab for now) I can’t free up any space. The datastore is empty but it still says uses is 4.4TB out of 5TB.
No, we fixed the issue on the problematic host, and its storage became available for use once more. I remember a refresh issue with the VSAN datastore views. Can you check it out in the datastores view, and refresh it there, to see it that triggers the UI to update.