Linux and DevOps technical stuff: NFS FAQ's

A1. What are the primary differences between NFS Versions 2 and 3?

A. From the system point of view, the primary differences are these:

Version 2 clients can access only the lowest 2GB of a file (signed 32 bit offset). Version 3 clients support larger files (up to 64 bit offsets). Maximum file size depends on the NFS server's local file systems.
NFS Version 2 limits the maximum size of an on-the-wire NFS read or write operation to 8KB (8192 bytes). NFS Version 3 over UDP theoretically supports up to 56KB (the maximum size of a UDP datagram is 64KB, so with room for the NFS, RPC, and UDP headers, the largest on-the-wire NFS read or write size for NFS over UDP is around 60KB). For NFS Version 3 over TCP, the limit depends on the implementation. Most implementations don't support more than 32KB.
NFS Version 3 introduces the concept of Weak Cache Consistency. Weak Cache Consistency helps NFS Version 3 clients more quickly detect changes to files that are modified by other clients. This is done by returning extra attribute information in a server's reply to a read or write operation. A client can use this information to decide whether its data and attribute caches are stale.
Version 2 clients interpret a file's mode bits themselves to determine whether a user has access to a file. Version 3 clients can use a new operation (called ACCESS) to ask the server to decide access rights. This allows a client that doesn't support Access Control Lists to interact correctly with a server that does.
NFS Version 2 requires that a server must save all the data in a write operation to disk before it replies to a client that the write operation has completed. This can be expensive because it breaks write requests into small chunks (8KB or less) that must each be written to disk before the next chunk can be written. Disks work best when they can write large amounts of data all at once.
NFS Version 3 introduces the concept of "safe asynchronous writes." A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol-compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply.
Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure.

For more information on the NFS Version 3 protocol, read RFC 1813.

A2. Can I run NFS across the TCP/IP Transport Protocol?

A. Client support for NFS over TCP is integrated into all 2.4 and later kernels. Server support for TCP appears in 2.4.19 and later 2.4 kernels, and in 2.6 and later kernels. Not all 2.4-based distributions support NFS over TCP in the Linux NFS server.

A3. Are there any other versions of NFS under development?

A. Yes. NFS Version 4 is being developed under the supervision of the Internet Engineering Task Force (IETF). The IETF hosts several documents that describe the NFS Version 4 working group's efforts to date. Several commercial vendors have already released NFS clients and servers that support the new version of NFS. A Linux implementation of NFS Version 4 is under development at the University of Michigan's Center for Information Technology Integration under the direction of Andy Adamson. This version is available now in the Linux 2.6 kernel. Although this is a reference implementation of an NFS Version 4 client and server, one of two such implementations required as part of the IETF's standards process, it is still missing some features. These features are currently under development and should appear soon. For more information, visit CITI U-M's NFSv4 project web site.

A4. How can I prevent the use of NFS Version 2, or of other NFS versions?

A. The protocol version is determined at mount time, and can be modified by specifying the version of the NFS protocol, or the version of the transport protocol, supported by the server. For example, the client mount command mount -o vers=3 foo:/ /bar will request that the server use NFS Version 3 when granting a mount request (Note that "vers" and "nfsvers" have the same meaning in the mount command; The string "vers" is compatible with NFS implementations on Solaris and other vendors). If you wish to prevent use of NFS Version 2 in all cases, then you must restart rpc.mountd on the server, with the option "-N 1 -N 2". The best way to do this is to modify the nfs rpc.mountd configuration on the server by modifying the NFS startup script options, and then shutting down and restarting NFS as a whole:

cd /etc/rc.d/init.d
Modify RPCMOUNTDOPTS in the nfs script to include "-N 1 -N 2"
Restart nfs (you must have root access) with "./nfs restart"

You will now get the following error when attemping to nfs mount a file system using NFS Version 2 (now unrecognized) after restarting rpc.mountd:mount: RPC: Unable to receive; errno = Connection refused You will also subsequently get the following (non-fatal) warning when you unmount any nfs mounted file system at all, regardless of when it was mounted: Bad UMNT RPC: RPC: Program/version mismatch; low version = 3, high version = 3

A5. Can I use Kerberos authentication with NFS on Linux?

A. Sun defined a new interface called RPCSEC GSSAPI that creates the ability to use authentication plug-ins for protocols like NFS that ride on top of RPC. This is the standard way of providing Kerberos authentication support for NFS. Support for NFS security mechanisms using RPCSEC GSSAPI is now under development in Linux, based on work that is already in the 2.6 kernel. When completed, RPCSEC GSSAPI will work with all versions of the NFS protocol. In addition to the three flavors of Kerberos security (authentication, integrity checking, and full privacy), RPCSEC GSSAPI will eventually support other security flavors such as SPKM3, and will be fully compatible with other implementations such as the one in Solaris. Besides kernel support for RPCSEC GSSAPI, additional support is required in the form of various user-level changes (the mount command, and a pair of rpcgss daemons, for example). Currently, only Fedora Core 2 has RPCSEC GSSAPI enabled in its kernels and user-level support integrated into its standard distribution. We expect that, as this work matures, it will be adopted by all 2.6-based distributions. Currently Fedora Core 2 supports only the use of Kerberos 5 authentication with NFS Version 4. Because of bugs and missing features, for now support for Linux NFS with Kerberos is appropriate only for early adopters, and not for production use. For more information on RPCSEC GSS, read RFC 2203. Information on the Linux implementation of RPCSEC GSSAPI is available here.

A6. What are the main new features in version 4 of the NFS protocol?

A. Here is a short summary of new features. For a complete discussion of these features, see the documentation provided by the NFSv4 Working Group.

NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces state. An NFS Version 4 client uses state to notify an NFS Version 4 server of its intentions on a file: locking, reading, writing, and so on. An NFS Version 4 server can return information to a client about what other clients have intentions on a file to allow a client to cache file data more aggressively via delegation. To help keep state consistent, more sophisticated client and server reboot recovery mechanisms are built in to the NFS Version 4 protocol.
NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its open and lock leases.
NFS Version 4 introduces file delegation. An NFS Version 4 server can allow an NFS Version 4 client to access and modify a file in it's own cache without sending any network requests to the server, until the server indicates via a callback that another client wishes to access a file. This reduces the amount of traffic between NFS Version 4 client and server considerably in cases where no other clients wish to access a set of files concurrently.
NFS Version 4 uses compound RPCs. An NFS Version 4 client can combine several traditional NFS operations (LOOKUP, OPEN, and READ, for example) into a single RPC request to carry out a complex operation in one network round trip.
NFS Version 4 specifies a number of sophisticated security mechanisms, and mandates their implementation by all conforming clients. These mechanisms include Kerberos 5 and SPKM3, in addition to traditional AUTH_SYS security. A new API is provided to allow easy addition of new security mechanisms in the future.
NFS Version 4 standardizes the use and interpretation of ACLs across Posix and Windows environments. It also supports named attributes. User and group information is stored in the form of strings, not as numeric values. ACLs, user names, group names, and named attributes are stored with UTF-8 encoding.
NFS Version 4 combines the disparate NFS protocols (stat, NLM, mount, ACL, and NFS) into a single protocol specification to allow better compatibility with network firewalls.
NFS Version 4 introduces protocol support for file migration and replication.
NFS Version 4 requires support of RPC over streaming network transport protocols such as TCP. Although many NFS Version 4 clients continue to support RPC via datagrams, this support may be phased out over time in favor of more reliable stream transport protocols.

For more information on the NFS Version 4 protocol, read RFC 3530.

A7. I've heard NFS Version 4 is not interoperable with earlier versions of NFS. What's the real deal?

A. In the same way that an NFS Version 3-only client cannot communicate with an NFS Version 2-only server, an NFS Version 4-only client or server cannot communicate with clients and servers that only support earlier versions of NFS. NFS Version 4 uses a different version number in RPC headers to distinguish the new protocol version. Thus, clients that support only NFS Version 4 cannot communicate with servers that support only versions 2 and 3. True interoperability is achieved by implementing clients and servers that can communicate using all three protocol versions: NFS Versions 2, 3, and 4. Early versions of the Linux NFS Version 4 prototype used two separate clients: the original client that supported NFS Versions 2 and 3, and a new separate client that supported only NFS Version 4. For various reasons this prevented the ability to mount NFS Version 4 servers at the same time as NFS Version 2 and 3 servers were mounted. This was an implementation choice, not a protocol limitation. This is no longer the case: the Linux 2.5 NFS client, and all future versions of the Linux NFS client, support all three versions seamlessly, and can concurrently mount servers that export version 2, version 3, and version 4. The goal is that NFS Version 4 will coexist with versions 2 and 3 in much the same way as NFS Version 3 coexists with NFS Version 2 today. Upgrading should be nearly transparent. There are some minor interoperability issues when applications running on clients make use of some of the new features of NFS Version 4 such as mandatory locking, share reservations, and delegations. These features help make NFS Version 4 more compatible with traditional Windows file systems like CIFS. Network Appliance, who makes file servers that can export file systems via both CIFS and NFS concurrently, has published papers describing some of these issues. See:

A8. What is close-to-open cache consistency?

A. Perfect cache coherency among disparate NFS clients is very expensive to achieve, so NFS settles for something weaker that satisfies the requirements of most everyday types of file sharing. Everyday file sharing is most often completely sequential: first client A opens a file, writes something to it, then closes it; then client B opens the same file, and reads the changes. So, when an application opens a file stored in NFS, the NFS client checks that it still exists on the server, and is permitted to the opener, by sending a GETATTR or ACCESS operation. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(). This behavior is referred to as close-to-open cache consistency. Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged. Close-to-open cache consistency was introduced to the Linux NFS client in 2.4.20. If for some reason you have applications that depend on the old behavior, you can disable close-to-open support by using the "nocto" mount option. There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of checking a file's attributes before and after an operation to allow a client to identify changes that could have been made by other clients. Unfortunately when a client is using many concurrent operations that update the same file at the same time, it is impossible to tell whether it was that client's updates or some other client's updates that changed the file. For this reason, some versions of the Linux 2.6 NFS client abandon WCC checking entirely, and simply trust their own data cache. On these versions, the client can maintain a cache full of stale file data if a file is opened for write. In this case, using file locking is the best way to ensure that all clients see the latest version of a file's data. A system administrator can try using the "noac" mount option to achieve attribute cache coherency among multiple clients. Almost every client operation checks file attribute information. Usually the client keeps this information cached for a period of time to reduce network and server load. When "noac" is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations. Be careful not to confuse "noac" with "no data caching." The "noac" mount option will keep file attributes up-to-date with the server, but there are still races that may result in data incoherency between client and server. If you need absolute cache coherency among clients, applications can use file locking, where a client purges file data when a file is locked, and flushes changes back to the server before unlocking a file; or applications can open their files with the O_DIRECT flag to disable data caching entirely. For a better understanding of the compromises faced in the design of NFS caching, see Callaghan's "NFS Illustrated."

A9. Why does opening files with O_APPEND on multiple clients cause the files to become corrupted?

A. The NFS protocol does not support atomic append writes, so append writes are never atomic on NFS for any platform. Most NFS clients, including the Linux NFS client in kernels newer than 2.4.20, support "close to open" cache consistency, which provides good performance and meets the sharing needs of most applications. This style of cache consistency does not provide strict coherence of the file size attribute among multiple clients, which would be necessary to ensure that append writes are always placed at the end of a file. Read all about the NFS cache consistency model here. Alternately, the NFS protocol could include a specific atomic append write operation, but today's versions of the protocol do not. The designers of the NFS protocol felt that atomic append writes would be rarely used, so they never added the feature. Even with such a feature, keeping the file size attribute up to date would be challenging.

A10. What does it mean when my application fails because of an ESTALE error?

A. The NFS protocol does not refer to files and directories by name or by path; it uses an opaque binary value called a file handle. In NFSv3 this file handle can be up to 64 bytes long; NFSv4 allows them to be even larger. A file's file handle is assigned by an NFS server, and is supposed to be unique on that server for the life of that file. Clients discover the value of a file's file handle by doing a LOOKUP operation, or by using part of the results of a READDIRPLUS operation. There is usually a special process done while mounting an NFS file system to determine the file handle of the file system's root directory. ESTALE is an error reported by a server when a file handle is not valid. Here are some common reasons why a file handle is not valid:

The file resides in an export that is not accessible. It could have been unexported, the export's access list may have changed, or the server could be up but simply not exporting its shares.
The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.
The file was renamed to another directory, and subtree checking is enabled on a share exported by a Linux NFS server. See question C7 for more details on subtree checking on Linux NFS servers.
The device ID of the partition that holds your exported files has changed. File handles often contain all or part of a physical device ID, and that ID can change after a reboot, RAID-related changes, or a hardware hot-swap event on your server. Using the "fsid" export option on Linux will force the fsid of an exported partition to remain the same. See the "exports" man page for more details.
The exported file system doesn't support permanent inode numbers. Exporting FAT file systems via NFS is problematic for this reason. This problem can be avoided by exporting only local filesystems which have good NFS support. See question C6 for more information.

A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file. Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle. Older Linux NFS clients do not recover from an ESTALE error, even during pathname resolution. In the 2.6.12 kernel and later, the Linux VFS layer can redrive pathname resolution when an ESTALE is encountered to recover appropriately.

B. Performance

B1. What can I do to to improve NFS performance in general?

A. Review the performance section of the NFS Howto doc and then look at several things:

How fast is the disk IO speed on your server(s)? That will have a big impact on overall NFS performance for both Version 2 and Version 3.
Does your application open its files with the O_SYNC option? That will force NFS Version 3 to behave exactly like (synchronous) NFS Version 2.
UDP requires IP fragment reassembly. If you see fragmentation errors indicated in the output of netstat -s you may want to increase the size of your socket buffers.
Have you started enough NFS daemons? Review the contents of /proc/net/rpc/nfsd, especially the line that begins with "th". The first number on that line is the total number of NFS server threads that are started and waiting for NFS requests. The second number indicates whether at any time all of the threads were running at once. The remaining numbers are a thread count time histogram. See the NFS How-to for details on tuning your server based on the data in this histogram.
Do your NICs and Switches/Hubs/Routers autonegotiate down to 10baseT or half duplex? Half duplex will give you many more network collisions, which are the worst thing possible for NFS performance in UDP.
Are you running ext3 or ReiserFS? You might look at placing the journal on a separate disk, or on NVRAM. As of January 2002, ext3 allows this, and Reiser has a patch available.

B2. Everything seems so slow and I think the default rsize and wsize are set to 1024 - what's going on?

A. Normally, the Linux NFS client uses read-ahead and delayed writes to hide the latency of NFS read and write operations. However, the client can cache only a single read or write request per page. Thus, if reading or writing a whole page requires more than one on-the-wire read or write operation (which it certainly does if rsize or wsize is 1024), each of these operations must complete before the next one can be issued. In the case of small NFS Version 3 write operations, the write must be FILE_SYNC because the client must fully complete each write before it issues the next one. Note that this limitation becomes especially significant for hardware that supports larger pages. For instance, many distributors provide a Linux kernel built for Itanium processors that uses 16KB pages rather than 4KB pages normally found on 32-bit x86 systems. On such a system, if wsize is smaller than 16KB, the client always sends write operations serially, if they occur in the same page. Finally, note that the maximum transfer size permitted by the Linux server (NFSSVC_MAXBLKSIZE) is set to 32KB when applying all patches involved with the implementation of NFS over TCP in the 2.4 kernels. The latest 2.4 kernels have TCP support integrated, and allow transfer sizes up to 32KB.

B3. Why can't I mount more than 255 NFS file systems on my client? Why is it sometimes even less than 255?

A. On Linux, each mounted file system is assigned a major number, which indicates what file system type it is (eg. ext3, nfs, isofs); and a minor number, which makes it unique among the file systems of the same type. In kernels prior to 2.6, Linux major and minor numbers have only 8 bits, so they may range numerically from zero to 255. Because a minor number has only 8 bits, a system can mount only 255 file systems of the same type. So a system can mount up to 255 NFS file systems, another 255 ext3 file system, 255 more iosfs file systems, and so on. Kernels after 2.6 have 20-bit wide minor numbers, which alleviate this restriction. For the Linux NFS client, however, the problem is somewhat worse because it is an anonymous file system. Local disk-based file systems have a block device associated with them, but anonymous file systems do not. /proc, for example, is an anonymous file system, and so are other network file systems like AFS. All anonymous file systems share the same major number, so there can be a maximum of only 255 anonymous file systems mounted on a single host. Usually you won't need more than ten or twenty total NFS mounts on any given client. In some large enterprises, though, your work and users might be spread across hundreds of NFS file servers. To work around the limitation on the number of NFS file systems you can mount on a single host, we recommend that you set up and run one of the automounter daemons for Linux. An automounter finds and mounts file systems as they are needed, and unmounts any that it finds are inactive. You can find more information on Linux automounters here. You may also run into a limit on the number of privileged network ports on your system. The NFS client uses a unique socket with its own port number for each NFS mount point. Using an automounter helps address the limited number of available ports by automatically unmounting file systems that are not in use, thus freeing their network ports. NFS version 4 support in the Linux NFS client uses a single socket per client-server pair, which also helps increase the allowable number of NFS mount points on a client.

B4. Why does NFS Version 2 seem so much faster than Version 3?

A. There are actually two problems here, plus a feature. First, some background; the NFS Version 2 protocol specification requires a server to record each write to permanent storage before it sends a reply to a client. This makes server and client reboot recovery very simple, and provides a good guarantee that data sent to the server is permanently stored. Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in/etc/exports. The name of this export option is "[a]sync" (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance). When set to "sync," Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to "async," the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery.

First problem:
The default value of this export option on Linux NFS servers before nfs-utils-1.0.1 was "async". If a system administrator did not specify either "sync" or "async" in /etc/exports, exportfs used "async" by default. This allowed the server to reply to Version 2 write operations and metadata update operations (such as CREATE or MKDIR) before the requested data was written to the server's disk, thereby greatly improving the performance of write operations as well as introducing the possibility of undetectable data corruption. Releases of nfs-utils starting with version 1.0.1 use a default value of "sync," which causes the Linux server to conform properly to the NFS protocol specification.
Second problem:
Support for NFS Version 3 in Linux 2.2's NFS server does not honor the "async" export option. Thus, by default on a system running Linux 2.2 with an old version of the nfs-utils package, NFS Version 2 writes are fast and unsafe, but Version 3 write and commit operations are safe, although slower, since they always follow the client's request for either UNSTABLE or FILE_SYNC (see question A1).
Feature:
When you use the exportfs command with its verbose option set, it displays the various export options in effect for each exported file system. If the "async" export option is set, it appears in the option list, but if "sync" is requested, it will not appear in the exportfs parameter list. This reflects the common usage of "sync" as the default in other platforms, but can be somewhat confusing.

B5. Why does default NFS Version 2 performance seem equivalent to NFS Version 3 performance in 2.4 kernels?

A. See B4 for background information on how export options affect the Linux NFS server's write behavior. Since Linux 2.4, the NFS Version 3 server recognizes the "async" export option. When this option is set, the server replies to clients before data has been written to permanent storage. The server also sends a FILE_SYNC response to the client, indicating that the client need not retain buffered data or send a subsequent COMMIT operation. This exposes the client to the same undetectable corruption as exists for NFS Version 2 (with "async") if the server crashes before it has actually written data to stable storage. (See question B6 for further discussion of this behavior and its consequences.) Note that even if a client sends a Version 3 COMMIT operation, the server replies immediately if the file system has been exported with the "async" option. Conversely, when the "sync" export option is used on a Linux 2.4 server, both Version 2 and Version 3 writes behave as required by the NFS protocol specification. In this case, NFS Version 3 has a performance advantage over NFS Version 2, while maintaining data resilience during a server crash. Note well that "[a]sync" also affects some metadata operations on the server.

B6. Why is the "async" export option unsafe, and is that really a serious problem?

A. The biggest problem is not just that it is unsafe, but that corruption may not be detected. In the Linux implementation of NFS Version 2, when the "async" export option is in effect, a Linux NFS server may crash before posting all NFS write requests to disk. A Version 2 client, however, always assumes data is permanently written to stable storage, and that it is safe to discard buffers containing the written data. After a server crash, the Version 2 client cannot know that unwritten data is lost; this is why Version 2 writes are supposed to be permanent before the server replies. Even if a client still has the modified data in its cache, the data on the server no longer matches what is cached on the client (since some or all of the writes did not complete before the server crashed). This may cause applications to make future decisions based on data cached by the client rather than what is on the server, thus further corrupting the file. For the Linux implementation of NFS Version 3, using the "async" export option to allow faster writes is no longer necessary. NFS Version 3 explicitly allows a server to reply before writing data to disk, under controlled circumstances. It allows clients and servers to communicate about the disposition of written data so that in the event of a server reboot, a Version 3 client can detect the reboot and resend the data. In summary, be sure all exports on your Linux NFS servers use the "sync" option by setting it explicitly or by upgrading your nfs-utils package to version 1.0.1 or later. If you need fast writes, be sure your clients mount using NFS Version 3. You may also improve write performance by adding the "wdelay" option to your exports.

B7. I have achieved pretty fast speeds in some client benchmarks, but when my client is heavily loaded, it slows down considerably. Why does that happen?

A. The Linux client limits the total number of pending read or write operations per mount point. This prevents the client from exhausting its memory with cached read or write requests when the network or server is slow. The hard limit is 256 outstanding read or write operations per mount point. When that limit is reached, the client does not issue a new read or write operation until at least one outstanding read or write operation completes, thus serializing all reads and writes on that mount point until load is reduced. Two ways of mitigating this effect are to:

Increase rsize and wsize on your client's mount points. This increases the amount of data that can be involved in outstanding reads or writes at any given time.
Mount the same server partition multiple times on your clients, and spread your applications among the mount points.

This limit has been removed in 2.6 and later kernels.

B8. Why won't my client let me use rsize or wsize larger than 8KB when I mount my Linux NFS server?

A. NFS Version 2 supports up to 8KB reads and writes. NFS Version 3 allows larger reads and writes (see question A1). Stock 2.4 kernels earlier than 2.4.20 do not support read or write operations larger than 8192 bytes for either NFS Version 2 or 3. Server-side TCP support, introduced as an experimental compile-time option in 2.4.20, increases the server's maximum I/O size to 32KB by increasing the value of NFSSVC_MAXBLKSIZE (see question B2). When a client mounts a file server, the file server advertises the largest number of bytes it can read or write in a single operation. Clients always use the smaller of the server's maximum and the value specified by the rsize and wsize values specified by the client in the mount command. Large values of rsize and wsize may inhibit performance when using UDP. UDP datagrams must be separated into fragments that fit within your network's Maximum Transfer Unit. The loss of any of these fragments requires retransmission of the whole datagram. This may have a particularly adverse impact on client performance if your network is congested. TCP is considerably better at recovering one or two lost segments and managing network congestion, so larger I/O operations are usually more effective at reliably boosting performance when using NFS over TCP.

B9. I use the "sync" or "noac" mount options. I've increased my wsize, but write throughput is lower than I expect. Why is this?

A. Normally, an NFS client delays sending application write requests, allowing application processing to overlap with NFS write operations. An NFS client only causes an application to wait for writes to complete when the application closes or flushes a file. When a client sends write operations synchronously, however, the client causes applications to wait for each write operation to complete at the server. This results in much lower performance. The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect. Applications enable synchronous writes for a single file by opening a file with the O_SYNC or O_DSYNC flags. System administrators enable synchronous writes for all files in a local file system by mounting that file system with the "sync" option. The "noac" mount option also enables synchronous writes. If it didn't, applications running on other clients would have a difficult time retrieving file modifications if a client delayed writes. Currently the Linux NFS client has a limitation which prevents it from safely generating large synchronous writes. The client breaks large write requests into on-the-wire write operations that are no larger than a single page to guarantee that write requests arrive on the server's disk in byte order (some applications depend on this behavior). Even if you set wsize larger than a page, the client will break any application write request into page-sized NFS write operations to meet this guarantee. In addition, if the server's page size is larger than the client's page size, the server is forced to do additional work when the client writes in small chunks. NFS clients normally align reads and writes to their own page size, which then may be unaligned on the server if it uses larger pages. Depending on the server OS and filesystem, this could result in a number of performance limiting problems.

B10. Sometimes my server gets slow or becomes unresponsive, then comes back to life. I'm using NFS over UDP, and I've noticed a lot of IP fragmentation on my network. Is there anything I can do?

A. UDP datagrams larger than the IP Maximum Transfer Unit (MTU) must be divided into pieces that are small enough to be transmitted. If, for example, your network's MTU is 1524 bytes, the Linux IP layer must break UDP datagram larger than 1524 bytes into separate packets, all of which must be smaller than the MTU. These separated packets are called fragments. The Linux IP layer transmits each fragment as it is breaking up a UDP datagram, encoding enough information in each fragment so that the receiving end can reassemble the individual fragments into the original UDP datagram. If something happens that prevents a client from continuing to fragment a packet (e.g., the output socket buffer space in the IP layer is exceeded), the IP layer stops sending fragments. In this case, the receiving end has a set of fragments that is incomplete, and after a certain time window, it will drop the fragments if it does not receive enough to assemble a complete datagram. When this occurs, the UDP datagram is lost. Clients detect this loss when they have not received a reply from the server after a certain time interval, and recover by retransmitting the datagram. Under heavy write loads, the Linux NFS client can generate many large UDP datagrams. This can quickly exhaust output socket buffer space on the client. If this occurs many times in a short time, the client sends the server a large number of fragments, but almost never gets a whole datagram's worth of fragments to the server. This fills the server's IP reassembly queue, causing it to become unreachable via UDP until it expels the useless fragments from the queue. Note that the same thing can occur on servers that are under a heavy read load. If the server's output socket buffers are too small, large reads will cause them to overflow during IP fragmentation. The client's IP reassembly queue then fills with worthless fragments, and little UDP traffic can get to the client. Here are some symptoms of this problem:

You use NFS over UDP with a large wsize (relative to the network's MTU), and your application workload is write-intensive, or with a large rsize with a read-intensive application.
You may see many fragmentation errors on your server or clients (netstat -s will tell the story).
Your server may periodically become very slow or unreachable.
Increasing the number of threads on your server has no effect on performance.
One or a small number of clients seem to make the server unusable.
The network path between your client and server may have a router or switch with small port buffers, or the path may contain links that run at different speeds (100Mb/s and GbE).

The fix is to make the Linux's IP fragmentation logic continue fragmenting a datagram even when output socket buffer space is over its limit. This fix appears in kernels newer than 2.4.20. You can work around this problem in one of several ways:

Use NFS over TCP. TCP does not use fragmentation, so it does not suffer from this problem. Using TCP may not be possible with older Linux NFS clients and servers that only support NFS over UDP.
If you can't use NFS over TCP, upgrade your clients to 2.4.20 or later.
If you can't upgrade your clients, increase the default size of your client's socket buffers (see below). 2.4.20 and later kernels do this automatically for the NFS client's socket buffers. See Section 5.3 of the NFS How-To for more information.
If your rsize or wsize is very large, reduce it. This will reduce the load on your client's and server's output socket buffers.
Reduce network congestion by ensuring your GbE links use full flow control, that your switch and router ports use adequate buffer sizes, and that all links are negotiating their fastest settings.

B11. Why does my server see so many ACCESS calls when using Linux clients?

A. Default NFS server behavior is to prevent root on client machines from having privileged access to exported files. Servers do this by mapping the "root" user to some unprivileged user (usually the user "nobody") on the server side. This is known as root squashing. Most servers, including the Linux NFS server, provide an export option to disable this behaviour and allow root on selected clients to enjoy full root privileges on exported file systems. Unfortunately, an NFS client has no way to determine that a server is squashing root. Thus the Linux client uses NFS Version 3 ACCESS operations when an application is running on a client as root. If an application runs as a normal user, a client uses it's own authentication checking, and doesn't bother to contact the server. The Linux NFS client should cache the results of these ACCESS operations. In fact, in the new 2.6.x kernels, it does this and it extends ACCESS checking to all users to allow for generic uid/gid mapping on the server. This also enables proper support for Access Control Lists in the server's local file system. In pre-2.6 kernels, the stock NFS client does not cache the results of ACCESS operations.

C. Common export configuration errors

C1. How are exported file systems and client mount points tracked on the server?

A. /etc/exports contains information about how file systems should normally be exported. This is only read by exportfs.

/var/lib/nfs/etab contains information about what filesystems should be exported to whom at the moment.
/var/lib/nfs/rmtab contains a list of which filesystems actually are mounted by certain clients at the moment.
/proc/fs/nfs/exports contains information about what filesystems are exported to actual client (individual, not subnet or whatever) at the moment.
/var/lib/nfs/xtab is the same information as /proc/fs/nfs/exports but is maintained by nfs-utils instead of directly by the kernel. It is only used if/proc isn't mounted.

C2. Can I modify export permissions without needing to remount clients in order to have them take effect?

A. Yes. The safest thing to do is edit /etc/exports and run "exportfs -r". Note that when a mount request arrives, mountd check .../etab to see if that host is allowed access. If it is, an entry is placed in .../rmtab and the filesystem is exported thus creating an entry in /proc/fs/nfs/exports. When you run "exportfs -io <options> host:/dir then the entry in ../etab is changed, or a new one is added. If it is a subnet/wildcard/netgroup entry, then every line in ../rmtab is checked to see if it matches. When a match is found, a host-specific entry is given to (or changed in) the kernel. When you run "exportfs -a" it makes sure that all entries in /etc/exports are properly reflected in ../etab. Any extra entries in etab are left alone. Once the correct content of etab has been determined, rmtab is examine to create a list of specific-host entries for any new entries in etab. This host-specific entries are given to the kernel. When you run "exportfs -r" it ignores the prior contents of ../etab and initializes etab to the contents of /etc/exportfs. Then it inspects rmtab and make an changes to /proc/fs/nfs/export that are necessary.

C3. My exports seem to be readable by everyone - or /etc/exports is not giving the intended permissions

A. /etc/exports is VERY sensitive to whitespace - so the following statements are not the same, due to the space between the option "hostname" and the opening parentheses: /export/dir hostname(rw,no_root_squash) /export/dir hostname (rw,no_root_squash) The first will grant hostname read and write access to /export/dir without squashing root privileges. The second will grant hostname read and write privileges with root squash, and it will grant everyone else read and write access, without squashing root privileges.

C4. I believe the Linux NFS server will not export a fat32 partition. Is that correct?

A. The FAT file systems can be exported, starting with the early 2.4 kernels, but if used extensively, it may cause grief. First, only those operations supported by the exported file system will be honoured. Operations such as "chown", "link", and "symlink" are not supported by these file systems, and will fail. Read/write/create etc., should be fine, as long as the files remain relatively unchanged. The most serious problem is that the FAT filesystem layout does not contain enough information to create a lasting identity needed for NFS to create persistent filehandles. For example, if you take a file, rename it to another directory, trunctate it, and write new data to it, there is nothing stored in the filesystem that can be used to show that the resulting file is, in any sense, the "same" as the original file, and there is no way to find the new file given any details about the original file. Therefore, the Linux NFS server cannot guarantee that once you have opened a file, you can continue to have access to that file, if the file is modified in the ways given above. NFS may then be unable to locate or identify the file correctly, and so may return ESTALE errors.

C5. Sometimes my client gets a "permission denied" error when attempting to mount a file system, even though it managed it a few hours earlier with no change to the configuration on the server.

A. Your server's /etc/exports is probably misconfigured. If the exports file contains both domain names and IP addresses, it can result in random client behavior when mounting, especially if your clients have multiple IP addresses registered with DNS. If you export a directory and one of its ancestors, and both reside on the same physical file system on the server, it can result in random client behavior when mounting.

C6. Which local file systems can I export with the Linux NFS server?

A. We expect the following local file systems to work, as they are tested often: ext2, ext3, jfs, reiserfs, xfs. These local file systems may work or may have a few minor-ish issues: iso9660, ntfs, reiser4, udf. Ask on the NFS mailing list for details. Any file system based on FAT or not having the ability to provide permanent inode numbers will have trouble with NFS versions 2 and 3 (see question C4). Local file systems that are known not to work with the Linux NFS server are: procfs, sysfs, tmpfs (and friends).

C7. Why should I disable subtree checking on my NFS server exports?

A. When an NFS server exports a subdirectory of a local file system, but leaves the rest unexported, the NFS server must check whether each NFS request is against a file residing in the area that is exported. This check is called the subtree check. To perform this check, the server includes information about the parent directory of each file in NFS file handles that are handed out to NFS clients. If the file is renamed to a different directory, for example, this changes the file handle, even though the file itself is still the same file. This breaks NFS protocol-compliance, often causing misbehavior on clients such as ESTALE errors, inappropriate access to renamed or deleted files, broken hard links, and so on. In the opinion of many, subtree checking causes much more trouble than it saves, and should be avoided in most cases. The subtree_check option is necessary only when you want to prevent a file handle guessing attack from gaining access to files that fall outside the exported part of your server's local file systems. If you need to be certain that noone can access files outside the exported part of a local file system, set up the partitions on your server so that you only export whole file systems.

D. Commonly occurring error messages

D1. I keep getting permission failure messages at my NFS server. What are they?

A. The messages you are mentioning take the following format: Jan 7 09:15:29 server kernel: fh_verify: mail/guest permission failure, acc=4, error=13 Jan 7 09:23:51 server kernel: fh_verify: ekonomi/test permission failure, acc=4, error=13 They happen when a NFS setattr operation is attempted on a file you don't have write access to. These messages are harmless.

D2. What is a "silly rename"? Why do these .nfsXXXXX files keep showing up?

A. Unix applications often open a scratch file and then unlink it. They do this so that the file is not visible in the file system name space to any other applications, and so that the system will automatically clean up (delete) the file when the application exits. This is known as "delete on last close", and is a tradition among Unix applications. Because of the design of the NFS protocol, there is no way for a file to be deleted from the name space but still remain in use by an application. Thus NFS clients have to emulate this using what already exists in the protocol. If an open file is unlinked, an NFS client renames it to a special name that looks like ".nfsXXXXX". This "hides" the file while it remains in use. This is known as a "silly rename." Note that NFS servers have nothing to do with this behavior. After all applications on a client have closed the silly-renamed file, the client automatically finishes the unlink by deleting the file on the server. Generally this is effective, but if the client crashes before the file is removed, it will leave the .nfsXXXXX file. If you are sure that the applications using these files are no longer running, it is safe to delete these files manually. The NFS version 4 protocol is stateful, and could actually support delete-on-last-close. Unfortunately there isn't an easy way to do this and remain backwards-compatible with version 2 and 3 accessors.

D3. What does this mean: svc: unknown program 100227 (me 100003)

A. It refers to a mount request by an NFS client which supports the Solaris NFS_ACL side-band protocol. The Linux NFS server in the mainline kernels does not support this protocol, but many distributions include patches that provide NFS_ACL support in their NFS implementation. The message can be ignored safely.

D4. I frequently see this in my logs:

kernel: nfs: server server.domain.name not responding, still trying
  kernel: nfs: task 10754 can't get a request slot
  kernel: nfs: server server.domain.name OK

A. The "can't get a request slot" message means that the client-side RPC code has detected a lot of timeouts (perhaps due to network congestion, perhaps due to an overloaded server), and is throttling back the number of concurrent outstanding requests in an attempt to lighten the load. Some possible causes:

Network congestion
Overloaded server
Packets (input or output) dropped by a bad NIC or driver....

D5. I just upgraded to the latest nfs-utils and now NLM locking no longer works on files residing on my NFS server. What's up?

A. There are permisions on the /var/lib/nfs/sm and /var/lib/nfs/sm.bak files that must be addressed. Whomever rpc.statd is running as must have ownership and rw access to those dirs. The permissions should be set to 700 for both. In addition, etab, rmtab, and xtab all must exist and be writable by root.

D6. I've mounted with the "intr" option but processes still become unkillable when my server is unavailable. How do I kill the processes so I can unmount them?

A. It is true that even when using the "intr" mount option, you will not always succeed in killing a task that is hanging on NFS. In these instances, the task is usually waiting in the kernel on some semaphore that is held by another process. Since signals cannot interrupt semaphores, the signal will have no effect on the hanging task. There have been some suggestioned solutions, but none have been implemented. One is to set up a special class of semaphores which are killable with 'SIGKILL', but replacing the relevant semaphores in the VFS and VM layers will not be possible before the 2.7 kernels the earliest. Another solution under consideration is to cause rpciod to awaken all waiting requests when a user requests an unmount, allowing them to exit with an error. Until these are implemented, you can work around this problem by killing all processes waiting for I/O to complete in a given file system:

use 'lsof' or some other means to identify processes waiting on files in the target file system,
kill -9 all the processes, then
kill -9 rpciod.

Another, less desirable, workaround is to use "soft" mounts. This will cause processes to stop retrying I/O after a time. Eventually processes become unstuck and your file system can be unmounted. However, soft mounts are not completely safe. See question E4 for a description of the risks of using "soft" mounts.

D7. How come lock recovery doesn't work for me?

A. When a client reboots, it should notify any servers it had previously mounted to release all locks that were held. It does this by invoking rpc.statd during system start up. There are several common problems that can prevent rpc.statd from working. First, be sure that your client has the appropriate startup script enabled (/etc/rc.d/init.d/nfslock on Red Hat distributions). Next, make certain that when rpc.statd starts up, the network is already available for it to work (some DHCP-configured hosts may have a problem with this, for example). Make sure that the client's nodename (uname -n) is the same as what is returned by gethostbyname(3) on your client. These can differ because of your nsswitch configuration, the contents of /etc/hosts, because your client is configured via DHCP, or because of DNS misconfiguration. The in-kernel lockd process uses a client's nodename to identify its locks when sending lock requests. Rpc.statd must send an identical string when it sends a recovery notification, otherwise the server has no way to match the notification to any locks it may still hold for the client. It is also recommended that the nodenames for your NFS clients be fully qualified domain names, not just a hostname. If another client in a different domain with the same hostname contacts your server, a fully qualified nodename on both clients will allow the server to distinguish between locks set on each client. When traversing a firewall between your clients and server, bi-directional RPC traffic must be allowed if you need lock recovery to work, as NLM is callback-based. Two important issues that may prevent the server from calling the client are:

Blocking locks (F_SETLKW) will be hampered since the client expects the server's lockd daemon to call it back as soon as any conflicting locks have been released and the lock granted.
Server reboot recovery will be broken, since the server's rpc.statd daemon will be incapable of notifying the clients that their locks have been lost and need to be recovered.

D8. When my application uses memory-mapped NFS files, it breaks. Why?

A. Usually this is because application developers rely on certain local file system behaviors to guarantee data consistency, rather than reading the mmap man pages carefully to understand what behavior is required by all file system implementations. Some examples: Although some implementations of munmap(2) happen to write dirty pages to local file systems, the NFS version of munmap(2) does not. An msync(2) call is always required to guarantee that dirty mapped data is written to permanent storage. A subtle ramification of the Linux NFS client's treatment of munmap(2) is that does not consider munmap(2) to be a close operation for the purposes of close-to-open cache coherency. The distinction between the MS_SYNC and MS_ASYNC flags is also important. MS_ASYNC will force dirty mapped pages to permanent storage eventually. OnlyMS_SYNC guarantees that the pages are written before msync(2) returns to your application. Therefore applications should use msync(MS_SYNC) to serialize data writes to mapped files. Finally, the Linux NFS client may not flush dirty mapped pages when a file descriptor is closed via close(2). Oftentimes during close processing, the client may flush mapped pages along with pages dirtied by a write(2) call, but this behavior is not guaranteed. Many applications will open a file, map it, then close it and continue using the map. The behavior described above is an attempt to optimize the performance of this use case.

D9. When I update shared executable files on my NFS exports, programs running on my clients all segfault. How come?

A. If you simply copy the new executable or library over an old version, you are violating the NFS cache consistency rules (described here) by changing a file that is being held open on your clients. Copying over executables creates a window during which an NFS client's cache may hold parts of the old version and parts of the new version, all combined in the same file. The correct way to update executables and shared libraries on your NFS shares is to use the install program with the '-b' option. That renames the version of the executable that is in use, then creates a brand new file to contain the new version of the executable.

D10. I'm trying to use flock()/BSD locks to lock files used on multiple clients, but the files become corrupted. How come?

A. flock()/BSD locks act only locally on Linux NFS clients prior to 2.6.12. Use fcntl()/POSIX locks to ensure that file locks are visible to other clients. Here are some ways to serialize access to an NFS file.

Use the fcntl()/POSIX locking API. This type of locking provides byte-range locking across multiple clients via the NLM protocol, or via NFSv4.
Use a separate lockfile, and create hard links to it. See the description in the O_EXCL section of the creat(2) man page.

It's worth noting that until early 2.6 kernels, O_EXCL creates were not atomic on Linux NFS clients. Don't use O_EXCL creates and expect atomic behavior among multiple NFS client unless you are running a kernel newer than 2.6.5. It's a known issue that Perl uses flock()/BSD locking by default. This can break programs ported from other operating systems, such as Solaris, that expectflock/BSD locks to work like POSIX locks. On Linux, using file locking instead of a hard link has the added benefit of checkpointing the client's cache with the server. When a file lock is acquired, the client will flush the page cache for that file so that any subsequent reads get new data from the server. When a file lock is released, any changes to the file on that client are flushed back to the server before the lock is released so that other clients waiting to lock that file can see the changes. The NFS client in 2.6.12 provides support for flock()/BSD locks on NFS files by emulating the BSD-style locks in terms of POSIX byte range locks. Other NFS clients that use the same emulation mechanism, or that use fcntl()/POSIX locks, will then see the same locks that the Linux NFS client sees. On local Linux filesystems, POSIX locks and BSD locks are invisible to one another. Thus, due to this emulation, applications running on a Linux NFS server will still see files locked by NFS clients as being locked with a fcntl()/POSIX lock, whether the application on the client is using a BSD-style or a POSIX-style lock. If the server application uses flock()BSD locks, it will not see the locks the NFS clients use.

D11. Why doesn't "mount -oremount,tcp" convert an NFS-mounted file system mounted with UDP to one mounted with TCP?

A. The "remount" option on the mount command only affects the generic mount options, such as ro/rw, sync, and so on (see man mount for a complete list of generic mount command options). The NFS-specific mount options listed on the nfs man page can't be changed with a "mount -oremount" style mount command. You must unmount your file system and mount it again with new options in order to modify the NFS-specific settings. Note that the mount command may update the contents of /etc/mtab whether or not the actual mount settings have changed in the kernel. So when you try mount -oremount with an NFS-specific mount option, subsequent mount commands may report that the setting is in effect. This is only because the mount command is reading /etc/mtab. The /proc/mounts file reflects the true mount options that the kernel is using.

D12. I didn't mount with "intr" (the default is "nointr") and some processes are unkillable when my server becomes unavailable. What can I do?

A. Use the umount command's "-f" flag to force an unmount. There will be a brief pause while the umount command attempts to contact the server, and then all outstanding requests to the server will be failed, thus making the processes killable. Some programs upon receiving an I/O error will just try more I/O, making them unkillable again. For this reason, try killing all processes on the stuck mount first first, and then run "umount -f". When the I/O requests fail, the process will become killable, will see the signal, and will die. Sometimes it can at a couple of interations of the "kill processes" then "umount -f" cycle until the filesystem is unmounted, but it usually works. If all else fails, you can still unmount the partition on which the processes are hanging using the "umount -l" command. This causes the stuck mount to become detached from the file system name space hierarchy on your client, and will thus no longer be visible to other processes. You can replace that mount point with another mount to the same server when it becomes available again, or to some other server if the remote data has moved. Note, though, that the old mount point will continue to consume client memory until the stuck processes have all died.

E. Using Linux NFS with alternate platforms

E1. I use a Tru64 Unix 4.x or SunOS 4.1.x client. NFS File locking does not seem to work unless I give all users permissions on the file.: A. The default specifications for NFS Versions 2 and 3 allow any user to lock a file regardless whether that user has permission to access the file. The writers of the Linux NFS server regarded this behavior as insecure, and chose to only allow users who have access to a file to be able to lock it. However, older SunOS and Tru64 clients, and some HP/UX clients, take advantage of the NFS specification by making all NFS file lock requests with the credentials of the daemon. This means that if the daemon does not have access to the files, the server will refuse to lock them. The export option no_auth_nlm is designed to alleviate this problem. Set it on any shares you wish to export to these clients. This will disable the authorization check on file lock requests.
E2. I'm not using Redhat or VALinux distros so the nfs-utils startup script in the rpm is broken. What do I do?: A. You should comment out the following line in the /etc/rc.d/init.d/nfs that says this: . /etc/rc.d/init.d/functions
E3. I'm using an Irix Client and I'm seeing an array of problems with file lists and cwd from a Linux server. The server is running NFS Version 3. Is this a Linux bug?: A. IRIX improperly deals with file handles of less than 32 bytes which the NFS server in Linux 2.4.x uses. SGI has addressed this problem in IRIX 6.5.13, which was released in 2001. A workaround to this problem is to use NFS Version 2. On the IRIX client, use vers=2 in your mount options.
E4. Why do I get NFS timeouts when I mount a Linux NFS server from my Solaris NFS client?: A. You get NFS timeouts because you are using soft mounts. Normally, mounts are hard, which requires the client to continue attempts to reach the server forever. Asoft mount allows the client to stop trying an operation after a period of time. A soft timeout may cause silent data corruption if it occurs during data or metadata transmissions, so you should only use soft mounts in the cases where client responsiveness is more important than data integrity. If you require the use of soft mounts over an unreliable link such as DSL, try using TCP, which is what Solaris uses by default. This will help manage the impact of brief network interruptions. If using TCP is not possible, then you should reduce the risk of using soft mounts with UDP by specifying long retransmission timeout values and a relatively large number of retries in the mount command options (i.e., timeo=30, retrans=10). Note that NFS over UDP now uses a retransmit timeout estimation algorithm in the latest 2.4 and 2.6 kernels, which means the timeo= mount option is less effective at preventing data corruption due to a soft timeout.

Linux and DevOps technical stuff

About Me

Monday, December 13, 2010

NFS FAQ's

No comments:

Post a Comment