NFS Best Practices – Part 1: Networking
There is a project currently underway here at VMware to update the current Best Practices for running VMware vSphere on Network Attached Storage. The current paper is a number of years old now, and we are looking to bring it up to date. There are a number of different sections that need to be covered, but we decided to start with networking, as getting your networking infrastructure correct will play a crucial part in your NAS performance and availability obviously.
We are also looking for feedback on what you perceive as a best practice. The thing about best practices is that something which might be correct for one customer may not be the correct thing for another customer. I hope you will continue reading to the end, and provide some feedback on how you implement your NFS network.
Since VMware still only supports NFS version 3 over TCP/IP, there are still some limits to the multipathing and load-balancing approaches that we can make. So although there are two connections when an NFS datastore is mounted two an ESXi host (one connection for control, the other connection for data), there is still only a single TCP session for I/O. From the current VMware Best Practices for NFS:
It is also important to understand that there is only one active pipe for the connection between the ESX server and a single storage target (LUN or mountpoint). This means that although there may be alternate connections available for failover, the bandwidth for a single datastore and the underlying storage is limited to what a single connection can provide. To leverage more available bandwidth, an ESX server has multiple connections from server to storage targets. One would need to configure multiple datastores with each datastore using separate connections between the server and the storage. This is where one often runs into the distinction between load balancing and load sharing. The configuration of traffic spread across two or more datastores configured on separate connections between the ESX server and the storage array is load sharing.
Throughput Options
Let’s begin the conversation by looking at some options available to you and how you might be able to improve performance, keeping in mind that you have a single connection between host and storage.
1. 10GigE. A fairly obvious one to begin with. If you can provide a larger pipe, the likelihood is that you will achieve greater throughput. Of course, if you’re not driving enough I/O to fill a 1GigE pipe, then a fatter pipe isn’t going to help you. But let’s assume that you have enough VMs and enough datastores for 10GigE to be beneficial.
2. Jumbo Frames. While this feature can deliver additional throughput by increasing the size of the payload in each frame from a default MTU of 1500 to an MTU of 9,000, great care and consideration must be used if you decide to implement it. All devices sitting in the I/O path must be able to implement jumbo frames for it to make sense (array controller, physical switches, NICs and VMkernel ports).
3. Load Sharing. One of the configuration options mentioned above and taken from the current white paper is to use multiple connections from the ESXi server to the storage targets. To implement this, one would need to configure multiple datastores, with each datastore using separate connections between the server and the storage, i.e. NFS shares presented on different IP addresses, as shown in the following diagram:
4. Link Aggregation. Another possible way to increase throughput is via the use of link aggregation. This isn’t always guaranteed to deliver additional performance, but I will discuss Link Aggregation in the context of availability later on in the post.
Minimizing Latency
Since NFS on VMware uses TCP/IP to transfer I/O, latency can be a concern. To minimize latency, one should always try to minimize the number of hops between the storage and the ESXi host.
Ideally, one would not route between the ESXi host and the storage array either, and have them both on the same subnet. In fact, prior to ESXi 5.0U1, one could not route between the ESXi host and the storage array, but we lifted some of the restrictions around this in 5.0U1 as per this blog post. It is still pretty restrictive however.
Security
All NAS array vendors agree that it is good practice to isolate NFS traffic for security reasons. By default, NFS traffic is sent in clear text over the traffic. Therefore, it is considered best practice to use NFS storage on trusted networks only. This would mean isolating the NFS traffic on its own separate physical switches or leveraging a dedicated VLAN (IEEE 802.1Q).
Another security concern is that the ESXi host mounts the NFS datastores using root privileges. Since this is NFS version 3, none of the security features implemented in later versions of NFS are available. To address the concern, again it is considered a best practice to use either a dedicated LAN or a VLAN for protection and isolation.
Availability
There are a number of options which can be utilized to make your NFS datastores highly available.
1) NIC teaming at the host level. IP hash failover enabled at the ESXi host. A common practice is to set the NIC Teaming failback Option to no. The reason for this is to avoid a flapping NIC if there is some intermittent issue on the network.
The above design is somewhat simplified. There are still issues with the physical LAN switch being a single point of failure (SPOF). To avoid this, a common design is to use NIC teaming in a configuration which has two physical switches. With this configuration, there are four NIC cards in the ESXi host, and these are configured in two pairs of two NICs with IP hash failover. Each pair is configured as a team at their respective LAN switch.
2) Link Aggregation Control Protocol (LACP) at the array level is another option one could consider. Link Aggregation enables you to combine multiple physical interfaces into a single logical interface. Now it is debatable whether this can improve throughput/performance since we are still limited to a single connection with NFS version 3, but what it does allow is protection against path failures. Many NFS array vendors support this feature at the storage controller port level. Most storage vendors will support some form of link aggregation, although not all configurations may conform to the generally accepted IEEE 802.3ad standard. Best to check with your storage vendor. One of the features of LACP is its ability to respond to events on the network and decide which ports should be part of the logical interface. Many failover algorithms only respond to link down events. This means that a switch could inform the array that an alternate path needs to be chosen, rather than the array relying on a port failure.
3) LACP at the ESXi host level. This is a new feature which VMware introduced in vSphere 5.1. While I haven’t tried this myself, I can see this feature providing some additional availability in the same way as it may provide additional availability for arrays, so that a failover to an alternate NIC can now occur based on feedback from the physical switch as opposed to just relying on a link failure event (if this is indeed included in the VMware implementation of LACP). You can learn a little bit more about host side LACP support in a whitepaper written by my colleague Venky. If you are using this new feature, and it has helped with availability improvements, I’d really like to know.
Miscellaneous Network Features
By way of completeness, I wanted to highlight a few other recommendations from our storage partners. The first of these is flow control. Flow control manages the rate of data flow between the ESXi host and storage array. Depending on the interconnect (1GigE or 10GigE), some array vendors make recommendations about turning flow control off and allowing congestion to be managed higher up the stack. One should always refer to storage array vendors best practices for guidelines.
The second is a recommendation around switch ports when Spanning Tree Protocol (STP) is used in an environment. STP is responsible for ensuring that there are no network loops in a bridged network by disabling network links and ensuring that there is only a single active path between any two network nodes. If there are loops, this can have severe performance impacts on your network with unnecessary forwarding of packets taking place, eventually leading to a saturated network. Some storage array vendors recommend setting the switch ports to which their array ports connect as either RSTP edge ports or Cisco portfast. This means that ports immediately transition its forwarding state to active. Refer to your storage array best practices for advice on this setting, and if it is appropriate for their storage array.
Useful VMware KBs for NFS networking
Useful NFS Best Practice References
- NetApp’s NFS Best Practices TR-3749
- EMC Isilon’s vSphere 5 Reference Architecture
- HDS vSphere 5 Reference Architecture
- Chad Sakac (virtual geek) multi-vendor NFS blog post – a recommended read
Feedback Request
So, the question is, do you do anything different to the above? This is just from a network configuration perspective. Later, I will do some posts on tuning advanced settings, and integration with vSphere features like Storage I/O Control, Storage DRS, Network I/O Control and VAAI. But for now I’m just interested in networking best practices. Please leave a comment.
Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @CormacJHogan
I believe that any document on storage best practices should cover the vast number of options,including Storage DRS, VAAI, VASA, etc. There are a LOT of complex interactions between what ESXi supports and what the vendors support. In a lot of cases, these best practices contradict each other and in many other cases, the recommendations are too vague to implement (flow control anyone?). In other cases, VMware will push a feature only to have it conflict with other features it supports (e.g. SDRS not being compatible with thin-provisioned VMDKs). Clear and current best practices would be VERY welcome.
Thanks for the comment Ed. Indeed, that is the end-goal of this best practices. That is why we are reaching out now via the blog post to figure out what people need from the paper.
As 10Gb nics become more prevalent, would like to see the newer option of Load Based Teaming (w/ NIOC) written about, as opposed to just link aggregation with IP hash.
We will be looking at that when we do the vSphere integration part of the best practices.
Cormac,
Any word when the new paper might be coming out? I am using NFS on 10GbE with LBT and would like to get VMware’s opinion on it.
Should be any day now Phil.
Hi Cormac – a couple of other random comments…
Is it still recommended to mount NFS exports via IP address as opposed to hostname? I still mount via IP address, but I’m asked this question every now and then.
Also, as a quick side note, tried to turn off flow control on Intel (igxbe) 10Gb nics as recommended by storage vendor and ran into ESX/intel 5.x driver bug that prevented me from changing it 🙁 Supposed to be resolved in future ESXi patch I believe.
Hey Stacy,
My understand is that we can use either FQDN or IP addresses for mounting. The one consideration is that each ESXi host sharing the datastore must use the same IP address or FQDN when mounting.
Thanks for the heads-up on the flow control bug. I’ll look into that.
Cormac
I see conflicting information here. In some places you’ll see that you should mount via IP and in others you’ll find functionality missing if you do (e.g. round-robin load balancing via FQDN). The last I heard was that VMware doesn’t care. I ran into conflicting messages from NetApp – for example, until the September release of their SRA for SRM, mounting a datastore via FQDN was not supported because they still consider it a best practice to mount via IP. It’s one of the many finger-pointing items we see between VMware and NetApp.
We will endeavour to get a best practice statement around this for you in the final whitepaper.
Hi guys
Peter from NetApp here. There are a few reasons we don’t recommend DNS (round robin or not) or host files for NFS mounts.
First, and this applies to situations where a different IP to the storage is resolved on each host, there was a behaviour, fixed in vCenter 5.0, where by vCenter would look under the host or FQDN and inspect the actual IP address used to mount the datastore. If it discovered different IPs to the same datastore on 2 or more hosts, it would rename any mismatches after the first to something like nfsdatastore(1). You can see this same behaviour, although it is correct behaviour, for every host you add to VC after the first – the internal datastore on each ESX server gets renamed. On shared datastores, this behaviour seriously broke vMotion, HA, DRS, etc. I explain this in more detail in my blog at https://communities.netapp.com/community/netapp-blogs/getvirtical/blog/2011/09/28/nfs-datastore-uuids-how-they-work-and-what-changed-in-vsphere-5.
Like I said, this vCenter behaviour is fixed in vCenter (and vCSA) 5.0 and later, but some of us wonder what other little gotchas are lurking.
There are NetApp reasons as well, as you point out, including SRA handling.
Another issue is that you now have infrastructure dependent on other infrastructure. If your DNS goes down or is inaccessible, you can’t mount datastores. Your favourite deity help you if your DNS service is virtualized and got SVMotioned or otherwise placed on the datastore you’re trying to mount.
The downside of using IPs is it makes it trickier to change the network architecture without having to remount datastores and have all your VMs go inaccessible. DNS can help you in some cases, but it depends on what you’re trying to do.
Still, as I and others have stated many times, a best practice is not the only way to do it – simply the best way to do it in most situations.
I hope this helps!
Peter
Thanks for taking the time to add this clarification Peter. Much appreciated.
– Maybe also give more attention to Ip Hash option (->cfr Frank Denneman’s article : http://frankdenneman.nl/networking/nfs-and-ip-hash-loadbalancing/)
– and maybe some other considerations when using HP flex-10 or Cisco’s solution.
– some other random thoughts : recommended NFS datastore sizes/aggregates,…,how SDRS deals with this, etc…
Thanks Erik – I’ll defintely consider these recommendations for the final white paper.
Erik, you’re absolutely right. It seems we can’t get anybody to agree on datastore sizes, how many datastores per aggregate, etc. SDRS of course has its own set of issues with thin provisioning or de-duped storage. It’s a pretty complex set of intersecting puzzle pieces. Throw pieces like asis and snapmirror into the picture and it just gets uglier because you’ve got restrictions on how large of a volume you can work with based on the memory and OnTap release of the filer. I would personally like to see a joint VMware/NetApp best practice document but we can’t seem to get the two sides to agree.
The next part of the best practices is to discuss the interoperability issues. Storage DRS will feature prominently.
Hello Cormac,
i am very interested to find out how LACP would help us more than it could in pre 5.1 version. I see a lot of fuss about it but we are still working with same restrictions if i am not mistaken. One physical switch upstream and ip hash from VMware side. I have to admit that KB article from VMware puts another layer of mistery on top. Can you give me a hint is this valid.
Thanks.
We will try our best to make sure all of this is clarified in the white paper.
Hi Cormac,
In all 3 parts of your article I see nothing about aligning disks. I know that Netapp mentions it in their best practice document, but I have heard people dismiss it on NFS since it is file level storage and that aligning storage is only needed on block level storage. Although it is not an NFS best practice on it’s own, a statement in the VMware best practice document would be nice, as it can be a (silent) performance killer.
That’s a good call Akos – I’ll be sure to include something around that in the whitepaper.
The red pill: storage is now another application traversing a lossy medium. IOPS, latency, and everything in between are key.
One tries to re-implement the equivalent FC-SAN guarantees in terms of performance, reliability/integrity, and security. LACP throughout your infrastructure should be configured on a 5 tuple load balance (which will use the ephemeral SRC port on storage VLANs) to use all available paths rather than filling buffers and reusing the same link in a bundle each time. Ethernet frames should definitely be Jumbo at 9000 bytes on *endpoints* and 9216 bytes on all intermediaries. You then have layer 2 storage VLANs for NFS using 9014 byte frames and a 8948 byte TCP packet payload (e.g. 1 jumbo vs. 6 standard packets, TCP 8948 vs. 1452 bytes thus saving many overheads on both host and guest storage endpoints). Also LLDP’s DCBX is becoming more important but for the moment the focus should be on the CoS value of storage frames including the prioritisation thereof and no-drop characteristics on both in-path and endpoint nodes. FQDN should not be used as now your storage is tightly coupled to DNS UDP packets and other administrative failure domains (unless you use host files). Definitely keep NFS on an isolated and localised VLAN. Do not route, expose, or share across data centres. Separate out storage tenancies, security zones, or tiered workloads in to defensive, instrumented, fully supportable failure domains and minimise common mode failures.
Also the TCP window size (+NFS R/W sizes) and TCP scaling options are important (scaling not supported by VMware yet?). If you do some throughput and BDP(Bandwidth Delay Product) calculations things get really interesting especially for 10Gbps. Recommendation is to use vscsiStats, Wireshark, and vmkping from NFS client to storage with the DF bit set + size of 8972 (allowing for ICMP/IP overheads) to truly understand what’s going on…
We’re doing some testing over @nodecity at the moment, watch this space… and great article Cormac! Maith an fear!