Heads Up! NetApp NFS Disconnects
I just received notification about KB article 2016122 which VMware has just published. It deals with a topic that I’ve seen discussed recently on the community forums. The symptom is that during periods of high I/O, NFS datastores from NetApp arrays become unavailable for a short period of time, before becoming available once again. This seems to be primarily observed when the NFS datastores are presented to ESXi 5.x hosts.
The KB article described a work-around for the issue which is to tune the queue depth size on the ESXi hosts which will reduce I/O congestion to the datastore. By default, the value of NFS.MaxQueueDepth is 4294967295 (which basically means unlimited). The workaround is to change this value to 64. This has been shown to prevent the disconnects. A permanent solution is still being investigated.
I recommend all NetApp customers read this KB article, whether you have been impacted or not.
Another interesting and pertinent article. We had the same symptoms under ESX 4.x as well – where the filer just seemed to go away and there were “Stop Accessing / Start Accessing” messages in the vmkernel log.
We found the cause of the high IO load (Symantec AV updating on lots of guests at once) but it’s useful to have this setting as another tool in the box should we see the problem re-occur.
Hi All, Dimitris from NetApp here.
Just to clear something up:
This is not something that’s an issue with NetApp systems. This has to do with the NFS client and the same behavior can be seen with other systems.
Thx
D
Hi Dimitris,
The KB article explicitly calls out NetApp.
I’m not so sure we have seen this on other systems, but let me check internally.
Cormac
The question is, how come it wasn’t an issue with older VMware revs? We are seeing the problem with RHEL 6.4 as well, but not previous revs.
When we get to the root cause, then we’ll be able to answer that question. For the moment, we just have the workaround.
@dikrek
@cormac
Paudie here from VMW Support
To be crystal clear
We are working very closely with the fine folks in Netapp Support on this problem as both organisations are acutely aware of the problem and trying to get to a speedly conclusion
The purpose of KB is to offer relief to our customers who have hit this issue while we root cause problem
(Disclosure, I work for NetApp)
Hey Cormac,
I too am hearing internally that this issue also exists with both Celerra and Isilon (perhaps others), but I cannot confirm, obviously. Is there a possibility that a recent change to the ESXi NFS client code is to blame?
Please let us know your findings and update the blog post as appropriate.
-Chris
Yep – I’ll definitely do that Chris.
I like to break stuff, what are we talking about with IO? At what point does it break? I have a couple VNX’s and a Celerra…
Cormac – Thank you for raising the awareness of this issue. For Those interested in greater details, the released fix and alternative workarounds (including the impact of such) I have posted the following:
http://virtualstorageguy.com/2013/02/08/heads-up-avoiding-vmware-vsphere-esxi-5-nfs-disconnect-issues/
This is a good article, we had a similar problem recently, now I’m wondering if this was related to issue, we eventually thought it was NetFlow configured on the NetApp array causing the issue. Since then we disabled NetFlow and reduced the IO load and added 2 more 10Gb uplinks to the LCAP connection to the NAS arrays which resolved the issue.
Just thought I’d share what I’ve learned from this issue.
I work for an independent contractor and have been working with one of our customers who has experienced this issue on a somewhat large scale. In this case, they were seeing these short, random disconnects across all of their 14 sites immediately after upgrading from vSphere 4.1 to 5.0u1 around June of 2012. These sites hosted multiple NFS datastores on mostly IBM-branded FAS20xx’s and the sites with the higher workloads were seeing a higher rate of disconnects. These typically lasted from a few seconds to a few minutes and ranged from once per month to multiple times per day.
Along with the issue manifesting itself in VMware workloads, it is also seen in newer Linux and some UNIX kernels, along with Oracle’s dNFS client (reported in NetApp bug 654196). Per my understanding, it is believed to be due to a recent change whereby the RPC slot table size has been increased, causing an significant increase outstanding IO allowed on the client. This change in traffic ‘pattern’ is triggering the flow control mechanisms in ONTAP and the subsequent disconnects. It is most prevalent on low memory systems as they don’t have the buffer space to absorb the workload as well during high IO times.
We then set out to tweak and tune various aspects of the network and storage controllers because we were informed that absolutely nothing has changed in the NFS client. Per VMware support, “We only see it as a mount point and there’s nothing to change on the host”. Later, from our talks with the VMware TAM and engineering we were advised (after 8 months of denial) to first turn off RFC3465 TCP Congestion Control as it was added/enabled in the newer release, but not documented. This had no effect. Later we were told the default NFS.MaxQueueDepth parameter had also changed in the 5.0 release, but not documented. VMware engineering stated this was a hidden option/parameter in 4.1 and is now visible in 5.0 via the GUI and/or CLI. The default setting was 64 in 4.1 and 4294967295 in 5.0.
For my customer, changing the NFS.MaxQueueDepth back to 64 on each host completely resolved this issue across all sites. This reversion to known-good 4.1 behavior was considered to be the fix and not a workaround. Mostly due to the fact that no one was able to provide a clear reason for the changes in the first place. Their performance was fine before the vSphere upgrade and is fine now after this was changed back. Perhaps this higher setting would have some performance benefits in 10GbE networks and/or a greater number of guests per host. I haven’t heard anything for or against this though.
I have also heard unconfirmed reports from colleagues of Isilon and VNX being affected, but haven’t seen anything in writing. Switching to iSCSI also alleviated these issues, but the customer preferred NFS due to it’s ease of management.
A point of clarification, since I’m familiar with the specifics of what transpired in issue that @jpl1079 has discussed.
Regarding: “Later we were told the default NFS.MaxQueueDepth parameter had also changed in the 5.0 release, but not documented. VMware engineering stated this was a hidden option/parameter in 4.1 and is now visible in 5.0 via the GUI and/or CLI. The default setting was 64 in 4.1 and 4294967295 in 5.0…..For my customer, changing the NFS.MaxQueueDepth back to 64 on each host completely resolved this issue across all sites. This reversion to known-good 4.1 behavior was considered to be the fix and not a workaround. Mostly due to the fact that no one was able to provide a clear reason for the changes in the first place. Their performance was fine before the vSphere upgrade and is fine now after this was changed back. ”
Response: The NFS queue depth, was _introduced_ in 5.0 so that SIOC would work with NFS. VMware (TAM and GSS) provided clear information regarding this during the post-mortem session with VMware and NetApp support engineers. Changing the NFS.MaxQueueDepth to 64 is, in fact, a workaround. Vaughn (NetApp) mentions this as well in his article above stating that “A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds” (SIOC or NFS.MaxQueueDepth to 64)
I would also like to point out that the VMware Technical Account Manager engaged both VMware and NetApp Support _&_ Engineering to drive attention to the issues and collaborative resolution. The VMware TAM provided advocacy and updates across both organizations and groups who worked endlessly to analyze and consult on logs and traces, test and implement resolution paths as well as engineer and document workarounds and fixes.
Personally, I wish that the impact had never occurred or that it had not taken so long to reach a resolution for our customers. As discussed post-mortem, we _all_ will strive to do better.
Lastly, my thanks to both NetApp and VMware engineers that worked on these issues. In particular, I’d like to thanks John Ferry at NetApp and Nathan Small and Paudie O’Riordan at VMware who worked diligently to drive this to resolution.
/rj
@ Ryan; Thanks for the clarification.
I am having the issue myself after upgrading to ESXi 5 and FAS2020 but changing the deph to 64 and turning on Storage IO control did not fix it. Does turning on storage IO control override the deph of 64?
The patch for the Netapp fas2020 is only supported through special account reps since it can not go to 8.0.5
We’ve started encountering this situation as well running 5.1 & OnTap 8.1.1RC1 — but we’ve had Storage IO Control turned on for months. NetApp bug ID 321428 mentions that this is a potential workaround & this only affects controllers with fewer than 4 logical procs. Our 3240 is quad. This directed me to related bug ID 654196. This doesn’t list an OnTap fix, but lists changes to the NFS client and for VMware tuning NFS.maxqueuedepth and possibly using SIOC. Of course, the VMware KB 2016122 lists those as workarounds and permanent fix to upgrade OnTap. sigh.
we just recently upgraded to ontap 8.1.3p1 and still experience this issue (vsphere 5.1 & netapp fas3210 on ontap 8.1.3p1)
When using VMFS datastores, I suppose that KB2016122 doesn’t apply ?
There is no correlation for MaxQueueDepth parameter between NFS and VMFS ?