7.5 NFS | System Performance Tuning2002

The Network Filesystem (NFS), first developed by Sun Microsystems, has become a ubiquitous part of the computing landscape, in large part because of its simplicity and availability on most platforms. There are two flavors of NFS: Version 2, which has been in use since 1985, and Version 3, which was introduced in 1993. Most modern NFS implementations support both versions, with the newer version preferred if both the server and client are capable of using it. The most important concept in NFS is that the protocol is stateless; the functionality is defined such that the client and server need not maintain a continuous connection. The client simply submits a request, and the server processes it and responds. The server may have crashed or been shut down between the request and the response, and neither side need worry.

NFS works within the context of several other services, namely a file lock manager, a mechanism to start service (the mount protocol), and an automated name -lookup mounter. File locking poses an interesting problem for a stateless file sharing protocol, because locking involves state implicitly. To work around this, locking is implemented by the Network Lock Manager protocol (NLM), which relies on a related protocol implemented in rpc.statd . A list of all locked files is maintained on the server, which is also responsible for pushing the lock information out to its clients . Because NFS is stateless, the server and client need a mechanism to determine the other's state in order to know when to reacquire a lock (e.g., when the server is rebooted) and when to invalidate a lock (e.g., when the client unmounts the filesystem); this is the role played by statd .

The mount protocol itself is not very interesting, except to note that all negotiation is done at mount time; that is, the block size , the local attribute management, the version of the protocol to use, and the transport protocol are all determined when the remote filesystem is mounted.

The automounter is a client-side application overlaid on the mount protocol. Remote filesystems are assigned locations within the filesystem, and all references to those locations are trapped by the automounter. If the remote filesystem is not mounted, the automounter finds the remote filesystem and mounts it in the appropriate location. After a period of inactivity, the automounter unmounts the filesystem. When combined with universal naming services like NIS and NIS+, the automounter can help define a consistent filesystem across an enterprise. The automounter imposes no significant load on any component.

The Version 2 protocol is extremely simple, implementing only 18 operations. In 1993, due to customer pressures for technical features, the NFS protocol was revised to Version 3, making improvements in some key areas:

Write operations are much faster, using a two-stage commit protocol.
The number of packets actually crossing the network is reduced by permitting file attributes to be returned on every operation. Every operation that modifies a file's state returns the modified attributes.
The maximum file size was increased from 2 ³² bytes to 264 bytes, and the maximum size of a data block was increased from 8 KB to 4 GB. ^[24]

^[24] The maximum size of the data block is usually determined by the underlying attributes of the network. For example, Solaris supports block sizes up to 64 KB, which is the largest block permitted by either the TCP or UDP implementation.
Much more sophisticated file access controls are supported (e.g., access control lists).

Although they are quite similar, the two versions of NFS are not directly interoperable. The client and server negotiate which protocol to use during the filesystem mounting process. Note that not all clients capable of Version 3 operation default like this, and some adjustments may be required on your platform.

The server reports the size of an exported filesystem only to allow the client to determine the amount of free space; this number is not relevant to manipulating files. Although the amount of free space may be reported as being very strange on the client (negative), file access will still work as expected. However, large files (bigger than 2 GB) are problematic on Version 2 servers because of an inherent addressability restriction in NFS Version 2.

You can select the version of the NFS protocol to use by specifying the vers= n option to the mount command, where n is either 2 or 3. In addition, either Version 2 or Version 3 may be run over TCP as well as UDP; this is specified by proto=tcp or proto=udp , respectively, as a switch to the mount command. A Version 2 mount will default to UDP, and a Version 3 mount will default to TCP.

7.5.1 Characterizing NFS Activity

In NFS Version 2, the bulk of operations consists of six functions: lookup , getattr , setattr , readlink , read , and write . These functions can be grouped into two categories: the first four manipulate the file's attributes (filename lookup, getting and retrieving attributes, and resolving symbolic links), and the last two involve reading and writing the contents of a file. These categories place very different loads on the server.

Attribute operations are lightweight. Because of their small size, most of the filesystem attributes will be cached in memory, and even if they are not, they are easily retrieved from disk. While the overhead of processing these requests is rather high (since the proportion of useful data bytes in the transmitted data is low), the overall small size of the packets leads to low bandwidth consumption.

However, data operations are a different story. In NFS Version 2, they are almost always the maximum size of 8 KB, and in Version 3 they can be much larger. In addition, while each file has only one set of attributes, it can have a great number of data blocks, most of which are not usually cached on the server. Because of the larger size of the operation, they also consume much more bandwidth.

NFS servers typically spend the vast majority of their time servicing attribute rather than data operations. When a client system wants to use a file on a remote server, it uses a series of lookup operations to find the file, a getattr operation to obtain the file's permissions mask, and finally issues a read to obtain the first block of data. For example, consider a read to a user 's .forward file. It is common practice on systems with many users to sort home directories alphabetically , so for illustration's sake let's assume the file is located in /shared/home/u/us/user/.forward . To read the file, the client must be able to access each of the four directories exported to it. Verifying this means looking up each entry in the containing directory and obtaining the permissions mask to determine if the permissions are sufficient -- a total of four lookup operations. ^[25] The file itself must then be looked up, the permissions checked, and a single block read. We have caused six operations, only one of which was a data operation!

^[25] The lookup operation causes attributes to be returned automatically for directories.

A scenario involving the access of many small files is attribute- intensive for reasons described previously. The classic example is a software development environment in which there are many small files, whether source code proper, header files, object files, or revision control files. Most home directory servers fit this pattern as well. Because NFS transactions block on the client, the performance of the system is dominated by the speed at which the server can handle the attribute requests. In a switched environment, 10 Mb/s Ethernet is usually acceptable for such a workload.

In a data-intensive environment where the average files are very large (e.g., image processing), the expense of transmitting data overcomes the attribute-processing time. When multiple clients are active on a repeated network, users begin to percieve sluggish performance at about 40% utilization; switched networks are immune to this phenomenon , although the uplink from the switch from the server can easily become very congested . Thankfully, NFS traffic is not generally constant; it is very burst-oriented. Although clients can make huge demands of servers and networks, it is done only on a sporadic basis, and there is little demand the rest of the time.

7.5.2 Tuning Clients

It is critical to realize that, on Unix systems, a remotely mounted filesystem is managed identically to a local disk subsystem. This means that the virtual memory subsystem is interposed between applications and the NFS client software; files stored on a remote mount are cached in memory just as local files are. This caching mechanism delays, and sometimes prevents , NFS activity.

Another approach to minimizing network activity is CacheFS, described in greater detail in Section 5.4.10. One of the best features of CacheFS is that it is completely client-side; the server is not aware of the client's internal caching. However, it is not a magic bullet. Because it makes copies of data blocks, the CacheFS subsystem must periodically scan the cached files, and causes any cached file that has been modified on the server to be purged and reloaded on the next access. Because most programs work on entire files (via read(2) and write(2) ) rather than on specific data blocks (e.g., via mmap(2) ), CacheFS usually ends up caching entire files. Filesystems that are frequently modified are, therefore, poor candidates for CacheFS; files will be continuously cached and then purged, which results in more network traffic than would be seen with an unaugmented NFS configuration. One excellent example of this is the /var/mail directory: a mail spool file of any substantial size will be flushed and reread every time a new mail message arrives, which will be very slow if new email is received frequently.

You can evaluate the performance of a CacheFS filesystem by using the cachefsstat(1m) , cachefswssize(1m) , and cachefslog(1m) commands. As a general rule, the hit rates for a given filesystem should be higher than 35%; lower hit rates mean that the cache is either too small, which is verifiable by comparing the size of the filesystem working set as reported by cachefswssize to the cache size, or the access pattern on the cached filesystem is very random. Rates of consistency failure in excess of 20% or so is a strong indication that the filesystem is being updated too quickly for CacheFS to be of real benefit.

In data-intensive NFS Version 3 environments, Solaris clients may wish to set the nfs:nfsv3_nra_kernel parameter to 6, which configures the client to request that the server read ahead six blocks of data. This value has shown good results in tests.

7.5.2.1 Obtaining statistics for an NFS-mounted filesystem

One way to determine the operational statistics for an NFS mount is to use iostat , as described in Section 5.5.4.

Another powerful tool is nfsstat -c , which displays some client-side NFS statistics. This example is taken from a lightly loaded workstation that mounts /home via NFS:

 #  nfsstat -c  ... Client rpc: Connection oriented: calls       badcalls    badxids     timeouts    newcreds    badverfs     164250      0           0           0           0           0            timers      cantconn    nomem       interrupts   0           0           0           0            ... Client nfs: calls       badcalls    clgets      cltoomany    159786      0           159786      0            Version 3: (158872 calls) null        getattr     setattr     lookup      access      readlink     0 0%        58283 36%   1863 1%     10003 6%    19134 12%   47 0%        read        write       create      mkdir       symlink     mknod        39926 25%   23395 14%   2209 1%     2 0%        12 0%       0 0%         remove      rmdir       rename      link        readdir     readdirplus  2276 1%     2 0%        487 0%      93 0%       480 0%      461 0%       fsstat      fsinfo      pathconf    commit       66 0%       1 0%        27 0%       105 0%       ... Version 3: (914 calls) null        getacl      setacl       0 0%        914 100%    0 0% ...

Notice I have trimmed nonrelevant output. There are a few red flags to watch for:

If the badxids field is approximately equal to the timeouts field, and both are greater than 5% of the number of calls, look for bottlenecks on the NFS server.
If the badxids field is relatively small but the timeouts field is greater than 5% of the number of calls, look for network problems resulting in dropped packets.

The badxids field counts how many times a reply was received from a server that did not correspond to any outstanding request; the timeouts field corresponds to the number of times an NFS call timed out while waiting for a reply from the server.

If you are using a service mounted via UDP, nfsstat -m will return some statistics concerning the average time to execute certain commands:

 %  nfsstat -m  /home from cassandra:/workspaces  Flags: vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768, wsize=32768,retrans=5 /mnt from london:/services/patches  Flags: vers=3,proto=udp,sec=sys,hard,intr,link,symlink,acl,rsize=32768, wsize=32768,retrans=5  Lookups:     srtt=7 (17ms), dev=3 (15ms), cur=2 (40ms)  Reads:     srtt=18 (45ms), dev=11 (55ms), cur=7 (140ms)

As you can see, the first mount ( /home ) is mounted with NFS Version 3 over TCP, whereas the second ( /mnt ) is mounted with NFS Version 3 over UDP. The srtt (smoothed round-trip time) field is key: if it rises above 60 ms or so, the device will begin to feel "slow." The dev field represents the variability in the srtt field, and cur describes the current timeout level for retransmission.

Another way to watch the behavior of an NFS-mounted filesystem in Solaris is to use iostat -xn , as described in Section 5.5.4:

 #  iostat -xn 30  ...                               extended device statistics   r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device   8.2 20.4   68.2 1042.1  0.0  1.3    0.0   46.7   0  47 c0t0d0   0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 cassandra:/workspaces   2.8  0.0   78.1    0.0  0.2  0.3   56.9   89.9   1   8 london:/services/patches ...

This illustrates that there's more than one way to do something in performance tuning.

7.5.2.2 The rnode cache

The rnode cache corresponds to the UFS inode cache, but caches information about files accessed over NFS. This data is gathered as a result of getattr calls to the NFS server (which keeps this information in a vnode of its own). The default rnode cache size is twice the size of the DNLC, and shouldn't need to be tuned .

7.5.2.3 Tuning NFS clients for bursty transfers

This section is most relevant to applications that cycle between two stages that each make significant contributions to the application runtime. One of these stages should be I/O intensive, and the other should not be. Many high performance computing applications exhibit this sort of pattern.

NFS operations on a single file, even from a single process, are split up by the kernel onto a number of kernel threads equal to the nfs3_max_threads parameter, which is 8 by default. The client caches up to 64 KB of written data per thread before causing the process to block; therefore, increasing the number of NFS3 kernel threads may decrease or eliminate the time the application spends waiting for networked I/O rather than computing. nfs3_max_threads should be kept to a reasonable value (in the range of 6-8 per processor). What is actually accomplished here is parallelism between the data transfer over the wire, which is done by the kernel on the buffers held by the NFS kernel threads, and the application processing. This will not improve maximum sustained NFS performance.

In order for this to work, the application must keep the file descriptor open ; if the file descriptor is closed, then you will also need to mount the remote filesystem with the undocumented -o nocto flag. By default, NFS clients will wait for their writes to complete to disk when a file is closed. This has the effect of ensuring consistency across other systems that may be able to read the file. The nocto flag has the effect of not guaranteeing such consistency, which can improve performance, but has the side effect of hiding changes to that file from other clients until some (unpredictable) later time.

7.5.2.4 Tuning NFS clients for sequential transfer

The nfs3_nra kernel parameter specifies the number of 32 KB blocks that an NFS client will try to read ahead in a given file. This can significantly improve NFS read throughput by amortizing the cost of an RPC call across more data. One upper bound on nfs3_nra is the number of NFS3 kernel threads (see Section 7.5.2.3 earlier in this chapter).

Another useful kernel parameter is clnt_max_conns . On a given client, all NFS requests that use TCP will share a pool of connections, governed by clnt_max_conns , which is equal to 1 by default. This has the side effect of distributing the load of processing across multiple CPUs. Tuning clnt_max_conns is most likely to buy you a performance increase on fast, multiprocessor machines connected via Gigabit Ethernet.

7.5.3 Tuning Servers

Tuning NFS servers is potentially a complicated issue. There are many questions to be answered :

Is the workload data or attribute intensive?
Are the clients able to cache the majority of their NFS-mounted data?
How many active clients must be supported?
How big are the filesystems to be shared?
Are the NFS requests directed at the same files repeatedly, or is the access pattern more random?
What is the underlying network configuration?
Is the underlying disk subsystem fast enough?
Is the workload adequately spread out over the available disks?

The single most important factor in tuning an NFS server is ensuring there is sufficient network bandwidth to the client, and the most important factor in determining how much bandwidth is "sufficient" is the type of activity that dominates the workload. Attribute-intensive NFS is easily handled with low-cost networking, ^[26] but data-intensive NFS demands a high-bandwidth infrastructure. While switched Ethernet is excellent for avoiding congestion in attribute-intensive environments, it does not address the primary problem with data-intensive NFS, which is insufficient bandwidth into the client. Full-duplex operation is considerably less important, because NFS operations tend to be unbalanced (the ratio of reads and writes is nowhere near 1:1). In most NFS installations, full-duplex Ethernet is indistinguishable from half-duplex.

^[26] Of course, 100 Mb/s Ethernet is perfectly acceptable for this application. The primary advantage, however, is that more clients can be configured on a single network before overloading occurs.

It is important in data-intensive environments to use NFS Version 3 for the simple reason that the maximum block size in NFS Version 2 is 8 KB. This effectively limits network throughput on a 100 Mb/s connection to 4 MB/s. The advantage of high-speed networks in NFS Version 2 is that multiple full-speed conversations can occur without network degradation. NFS Version 3 improves substantially on the possible throughput, approaching the limits of the underlying network infrastructure, which is mainly due to the much larger data block sizes. This means the network strategy must be quite a bit different, relying instead on dedicated 100 Mb/s Ethernet or very low-density ATM networks.

It is also important to consider how many clients will be active at once. Very few applications present continuous NFS demands. Table 7-17 may be of use in approximating the number of supportable active clients.

Table 7-17. Estimating the number of supportable clients

Network media	Attribute-intensive clients/network	Data-intensive clients/network
10BASE-T	15-20	Not a good idea
100BASE-T (Repeated)	150-200	10-15
100BASE-T (Switched)	200-280	15-18
ATM/FDDI	200-300	15-20

Processor performance is usually not the limiting factor in NFS server performance. The processor is used primarily to process the network packets themselves and to undertake the work required by the protocol -- generally in that order -- as well as manage the interfaces for the disk and network devices. While IA-32-based systems usually seem to offer attractive price/performance in this arena, they are often hampered by relatively slow memory subsystems. In general, any fast, modern uniprocessor or mid-range multiprocessor system should be able to handle a large amount of attribute-intensive network traffic. A general rule of thumb for data-intensive traffic is that a single 440 MHz UltraSPARC-IIi processor can service about 170 Mb/s. On Solaris and Linux systems that use the in-kernel NFS daemon, scalability is between 75-90%, meaning that adding a second processor increases performance by 75-90%. This is nearly the worst case situation for SMP scalability, because the NFS service is handled entirely within kernel space, where locking is heavily stressed. Secondary caches improve performance by about 10%.

Disk performance is usually critical to NFS performance. After all this discussion of bandwidth, please note that this level of performance is achievable only if the filesystem being served resides on a disk subsystem fast enough to operate at these rates.

7.5.3.1 Designing disk subsystems for NFS servers

Attribute-intensive NFS is nearly all random-access. As a result, the disk read/write heads spend much more time seeking than they do actually retrieving data. The throughput of a disk in a random-access workload is relatively small. As a result, many disks can be configured on a single SCSI bus, and the goal in configuration should be to provide as many spindles as possible, since they are the limiting factor in peformance. For minimal latency, one rule of thumb is to configure one 7,200 rpm disk for every three clients. For attribute-intensive environments with a high amount of writes, using NVRAM in a disk array can substantially reduce disk utilization.

Data-intensive environments are much simpler. As a general guideline, if the network is not a limiting factor, a single active Version 2 client consumes 5.5 MB/s and a Version 3 client consumes 11 MB/s. This means that a stripe or other RAID configuration is essential, because even the fastest disk is unlikely to transfer more than 14 MB/s, and any concurrent use can substantially degrade throughput.

Using logging filesystems for the NFS server can be very useful (see Section 5.4.3), as it provides a way to safely and quickly commit updates to the filesystem and accelerates consistency checking at boot-time. In most cases, however, there is a performance penalty associated with using logging: the operation must first be logged, applied to the filesystem, then cleared from the log. In situations dominated by small writes (up to 16 KB), writing to a logging filesystem is about as fast as writing to a nonlogging filesystem. If many writes are being committed at once, logging is potentially much faster, because the seek distances inside the log are much smaller than the seek distances in the actual filesystem. For large writes, however, logging can impose a performance penalty of up to 20%.

7.5.3.2 NVRAM caching

NVRAM write acceleration substantially improves write throughput in NFS Version 2 environments because all writes in Version 2 are synchronous; they must be committed to nonvolatile storage before the operation can return. ^[27] This incurs at least three disk writes for a typical file: one to update the actual data, one to update the file's directory information, and an indirect block (and quite possibly a doubly indirect block). This means that an NFS Version 2 write can take 120 ms (3 synchronous writes at 20 ms apiece), or 6 times as long as the normal 20 ms or so for a local disk write. By committing writes to NVRAM instead of disk, write performance is greatly increased. Because this performance boost can be as much as 100%, NVRAM caching should almost always be configured. The only exception to this is in read-only NFS servers, because the NVRAM cache reduces the maximum throughput of the system by about 5% due to the overhead in managing the cache. This is nearly always a fair trade for the vast improvement in response time, if only because the maximum throughput of capacity of most systems is much greater than the typical load; it is a classic tradeoff between throughput and latency.

^[27] Note that this is not the same as "to disk," and is intentional on the part of the designers of the protocol.

One of the design goals of NFS Version 3 was eliminating synchronous writes. Although NVRAM acceleration is still beneficial, the improvement is much smaller (on the order of 5%). There is no real reason to configure NVRAM caches for servers that only offer NFS Version 3 services.

Nonvolatile memory caching is still critically important for RAID 5 configurations because the disk array has no connection to the filesystem, and as a result it can accelerate more than just synchronous writes (see Section 6.2.6).

7.5.3.3 Memory requirements

The gut reaction of many systems administrators is to configure NFS servers with very large amounts of main memory to try and maximize the benefit of the caching implemented in the virtual memory subsystem. However counterintuitive it may seem, in most cases this is futile due to two factors. First, the size of the exported filesystem is much larger than the size of main memory; and second, most clients typically do not share most of their on-disk working set with other clients. The one exception to this rule is in temporary files where the client systems are short of memory.

There is a simple rule for sizing NFS server main memory: provide enough memory to cache any data that is likely to be referred to more than once every five minutes. ^[28] This can be roughly estimated at 128 MB of main memory per UltraSPARC-class microprocessor. It is worth taking into consideration that attribute-intensive environments benefit a little bit more from having large main memory subsystems than do data-intensive environments. If the NFS server provides temporary file space that is used heavily, configure main memory to be about 75% of the size of the active temporary files. ^[29] For application servers, main memory should be roughly the size of all heavily used binary files and libraries; this is not a typical use of NFS, and in fact it is very amenable to caching its active data. Because NFS service typically runs entirely within kernel space, the system requires essentially no swap space except for saving crash dumps in the event of a panic. In fact, you can configure a pure NFS server with no swap space at all!

^[28] This "five-minute rule" actually originated in sizing main memory for database servers, and is based on an estimation of the cost per second of caching data.

^[29] However, keep in mind that redirecting these temporary files to a filesystem local to the client is likely to provide much improved client performance as well as decreased network traffic.

Configuring NFS servers with large amounts of main memory is usually a waste of time.

7.5.3.4 The two basic types of NFS servers

Generally, there are two classes of NFS servers. Some run as user processes, much like sendmail ; these are called user-space NFS servers . However, because of the performance improvements associated with having the NFS server tightly bound into the operating system, most modern NFS servers run entirely inside the kernel, like fsflush or bdflush (see Section 4.4.2.1 or Section 4.4.2.2). These are called in-kernel NFS servers . If you are at all concerned with NFS performance, your first step should be to use an in-kernel NFS server; the stock Solaris NFS server is of this type, as is the Linux knfsd server.

One side effect of running in kernel space is that all NFS services are handled at a higher priority than user processes. This means it is a bad idea to run other applications on a high-use NFS server, because the other processes will slow down.

Another side effect is that CPU time spent servicing NFS work is reported in the sys field of the CPU monitoring utilities (see Section 3.6.3). In general, CPU power on an NFS server should be increased if system time is consistently more than 50% of the overall processor utilization.

7.5.3.5 Tuning the number of NFS threads

Each NFS thread can handle a single NFS request at a time. Therefore, a larger pool of threads allows the server to deal with more concurrent NFS requests. The default setting, 16, is almost certainly too small. However, extra NFS threads are not problematic. To find an appropriate number, apply the following rules and take the largest value:

Use at least 2 threads for every active client process that is accessing an NFS resource.
Use 16 to 384 NFS threads per CPU. Slow systems like the SPARCstation 5 should use 16, and 450 MHz UltraSPARC-II processors should use 384.
Use 16 NFS threads for every 10 Mb/s of served networks.

You actually adjust this parameter by editing the line in /etc/init.d/nfs.server that starts nfsd . For example, to start 128 threads, use this line:

 /usr/lib/nfs/nfsd -a 128

The number of NFS server threads should be increased when the number of active threads approaches the number that were started. You can find out approximately how many active NFS server threads there are by running this command: ^[30]

^[30] This command actually counts the number of threads that are run through svc_run . There are other, non-NFS threads that do this, but NFS threads are typically the majority.

 csh#  echo '$<threadlist'  adb -k & grep svc_run  grep -v grep  wc -l  4

7.5.3.6 Adjusting the buffer cache

The buffer cache is used to cache disk I/O related to inode and indirect blocks only. The size of this cache is determined by the bufhwm variable, which is specified in KB. The default value is zero, which allows up to 2% of system memory to be used. This value can be increased to up to 20%. On a large system, this variable may need to be limited in order to prevent the server from exhausting the kernel address space.

You can monitor the activity of the buffer cache using sar -b :

 %  sar -b 5 10  ... 01:22:48 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 01:23:03       0      71     100      32     103      69       0       0

This system is rather quiet. However, if a significant number of reads and writes occur per second (greater than 50), and the read hit rate ( %rcache ) and write hit rate ( %wcache ) fall below 90 and 65%, respectively, then the size of the buffer cache should be increased. You can read more about the buffer cache in Section 5.4.2.8.

7.5.3.7 The maxusers parameter

The maxusers parameter controls the size of various kernel tables, such as the process table. It is dynamically sized and equivalent to the amount of configured physical memory: ^[31] any system with more than 1 GB of RAM will have maxusers set to 1,024. If manually set, the minimum is 8 and the maximum is 2,048. The most important thing about this parameter is that it is used to compute the default size for the inode and directory name caches, which are controlled by ufs_ninode and ncsize . By default, these values are set to 7 x maxusers + 9 C .

^[31] This actually excludes the memory taken by the kernel at boot time.

7.5.3.8 The directory name lookup cache (DNLC)

The directory name lookup cache , or DNLC, caches directory lookups. A miss in this cache means that a disk I/O may be required to read the directory when walking the path name components , which is necessary to retrieve a file. The getattr , setattr , and lookup operations, which can represent more than 50% of the total number of NFS calls, all rely on the operation of this cache mechanism. You can find the hit rate of this cache by using vmstat -s :

 %  vmstat -s  grep name  4300527 total name lookups (cache hits 92%)

In general, if the hit rate is below 90%, the ncsize variable should be tuned. This parameter controls the size of the DNLC in terms of the number of translations between names and locations on disk that can be cached. Each entry uses about 50 bytes, and the only limit on the size of the DNLC is the amount of available kernel memory. Because NFS servers generally do not need much physical memory, some tuning here is usually required. On a dedicated NFS server, doubling the amount specified by the default is reasonable (see Section 7.5.3.7 earlier in this chapter).

7.5.3.9 The inode cache

Every time an operation is performed on a filesystem, the inode read from disk is cached in case it is needed again. The number of idle inodes that are kept in the cache is governed by the ufs_ninode parameter. Each idle inode consumes about 320 bytes of kernel memory. Because every entry in the DNLC points to an entry in the inode cache, the two should be sized together.

Because ufs_ninode is simply a limit, you can change its value on a running system using adb . The tested upper limit corresponds to a value of 2,048 for maxusers, which is equivalent to a value of 34,906 for ncsize . Rather than directly tuning the inode cache, however, tune ncsize and let the system pick the ufs_ninode parameter.

7.5.3.10 Observing NFS server performance with nfsstat

One of the most useful utilities for picking up information on how an NFS server is performing is nfsstat -s , which displays some statistics on NFS service:

 #  nfsstat -s  Server rpc: Connection oriented: calls       badcalls    nullrecv    badlen      xdrcall     dupchecks    9035483     0           0           0           0           165117       dupreqs      0            ... Version 3: (8616387 calls) null        getattr     setattr     lookup      access      readlink     3651 0%     3446452 39% 25 0%       1509014 17% 2317518 26% 1828 0%      read        write       create      mkdir       symlink     mknod        1037342 12% 0 0%        12 0%       1 0%        1 0%        0 0%         remove      rmdir       rename      link        readdir     readdirplus  69 0%       0 0%        8 0%        0 0%        102506 1%   165003 1%    fsstat      fsinfo      pathconf    commit       1263 0%     1248 0%     30446 0%    0 0%         ...

There is a lot of data here, but there are a few important things to look out for.

A large number in the badcalls field indicates that there is a user in too many groups or there are many attempts to access a nonexported filesystem.
If readlink is greater than about 10%, users are using too many symbolic links.
If getattr is bigger than about 60%, attributes are probably not being cached successfully on the client. This is adjusted by the actimeo option to mount.
If null is greater than 1%, the automounter timeouts are probably too short.
If the number of writes exceeds 5%, it is probably an excellent idea to investigate some sort of nonvolatile caching or filesystem logging on the server.

7.5.4 Wide Area Networks and NFS

NFS clients and servers are often located on discontiguous networks joined by one or more routers. This brings up a problem that we have not discussed yet: the latency of network links. In local area networking, latency is not generally a problem because the distances are very short, minimizing media delays. Latency is of vital importance because of the request-response nature of most attribute-intensive workloads -- each request must receive a response before the next request can be issued. The latency of wide area networks can be high for three reasons:

Wide area networks are much more susceptible to transmission errors, which can cause significant retransmission of data. The time sent to transmit a single packet can be far higher than expected.
The physical media used to transfer data over long distances (particularly satellite links) can inject significant latency.
Routers take a finite amount of time to route packets from one network to another. Since there are usually several routers between networks, this can add up.

NFS over wide area networks is certainly possible, particularly in data-intensive environments where the ease of use of NFS overcomes the lessened bandwidth. The increased latency causes a substantial performance hit in attribute-intensive installations, but this does not necessarily destroy the idea. CacheFS can be of greater use in these environments, as can the use of TCP as a transport mechanism.