VSAN Part 29 – Cannot complete file creation operation
There was a very interesting discussion on our internal forums here at VMware over the past week. One of our guys had built out a VSAN cluster, and everything looked good. However on attempting to deploy a virtual machine on the VSAN datastore, he kept hitting an error which reported that it “cannot complete file creation operation”. As I said, everything looked healthy. The cluster formed correctly, there were no network partitions and the network status was normal. So what could be the problem?
This is the error that popped up when the VM was being provisioned on the VSAN datastore:
Note also the ‘Failed to connect to component host” message. This might give you a clue as to the root cause. This had many of us scratching our heads until one of our engineers asked a question about MTU settings on the VSAN network. MTU defines the maximum transmission unit (packet size/frame size) that can be sent over the network. In this case, an MTU of 9000 (jumbo frames) was configured on the switch. However in this setup, it seems that an MTU of 9000 on the switch (DELL PowerConnect) wasn’t large enough to match the MTU of 9000 required on the ESXi configuration. The switch actually required an MTU of 9216 (9 * 1024) to allow successful communication using jumbo frames on the VSAN network. Once this change was made on the switch, virtual machines could be successfully provisioned on the VSAN datastore.
So why didn’t VSAN report this as an issue? Currently, VSAN doesn’t use the larger jumbo frames to check that the cluster is correctly formed. Now that we know about this behaviour, we will be looking at addressing it going forward.
Note that the VSAN network also requires multicast, but the cluster will not form without that functionality.
back in the days when doing a lot with Dell EQL i ran into the same jumbo frame vs Dell PowerConnect switches…. so unfortunate we have different vendors doing “marketing 9k frames” and “real 9k frames”
It is good practice to have a smaller MTU on your servers than on the Switches. This is because switches (including the vSwitch) need to tack on VLAN tags and sometimes other transit tags on the packets.
Solution would be for the VMkernel to do Path MTU Discovery (PMTUD) to avoid issues like this.
VMKernel doesn’t fragment and doesn’t have PMTUD, which causes many problems in different scenarios including this one. Bigger problems in NSX environments due to the lack of fragmentation ability in VMKernel. Hopefully we get a better network stack in the next major release. Most switches support 9216 as the MTU or higher, so for VMKernel that need Jumbo Frames that is what is normally used on the physical switch side, Cisco, Dell etc.
I am getting this error deploying an OVF from a UNC path, but my Cisco switch has mtu 9170 set. I wonder if it’s because am somehow crossing from Jumbo to non-Jumbo. The VSAN network is Jumbo but nothing else is. The server housing the OVF is on the VSAN. My workstation is (obviously) outside that network. I’m going to copy the OVF elsewhere and see what happens..