Section 9.3. Techniques for Improving Performance

   


9.3. Techniques for Improving Performance

Remote filesystems provide a challenging performance problem: Providing both a coherent networkwide view of the data and delivering that data quickly are often conflicting goals. The server can maintain coherency easily by keeping a single repository for the data and sending them out to each client when the clients need them; this approach tends to be slow, because every data access requires the client to wait for an RPC round-trip time. The delay is further aggravated by the huge load that it puts on a server that must service every I/O request from its clients. To increase performance and to reduce server load, remote filesystem protocols attempt to cache frequently used data on the clients themselves. If the cache is designed properly, the client will be able to satisfy many of the client's I/O requests directly from the cache. Doing such accesses is faster than communicating with the server, reducing latency on the client and load on the server and network. The hard part of client caching is keeping the caches coherent that is, ensuring that each client quickly replaces any cached data that are modified by writes done on other clients. If a first client writes a file that is later read by a second client, the second client wants to see the data written by the first client, rather than the stale data that were in the file previously. There are two main ways that the stale data may be read accidentally:

  1. If the second client has stale data sitting in its cache, the client may use those data because it does not know that newer data are available.

  2. The first client may have new data sitting in its cache but may not yet have written those data back to the server. Here, even if the second client asks the server for up-to-date data, the server may return the stale data because it does not know that one of its clients has a newer version of the file in that client's cache.

The second of these problems is related to the way that client writing is done. Synchronous writing requires that all writes be pushed through to the server during the write system call. This approach is the most consistent because the server always has the most recently written data. It also permits any write errors, such as "filesystem out of space," to be propagated back to the client process via the write system-call return. With an NFS filesystem using synchronous writing, error returns most closely parallel those from a local filesystem. Unfortunately, this approach restricts the client to only one write per RPC round-trip time.

An alternative to synchronous writing is delayed writing, where the write system call returns as soon as the data are cached on the client; the data are written to the server sometime later. This approach permits client writing to occur at the rate of local storage access up to the size of the local cache. Also, for cases where file truncation or deletion occurs shortly after writing, the write to the server may be avoided entirely because the data have already been deleted. Avoiding the data push saves the client time and reduces load on the server.

There are some drawbacks to delayed writing. To provide full consistency, the server must notify the client when another client wants to read or write the file so that the delayed writes can be written back to the server. There are also problems with the propagation of errors back to the client process that issued the write system call. For example, a semantic change is introduced by delayed-write caching when the file server is full. Here, delayed-write RPC requests can fail with an "out of space" error. If the data are sent back to the server when the file is closed, the error can be detected only if the application checks the return value from the close system call. For delayed writes, written data may not be sent back to the server until after the process that did the write has exited long after it can be notified of any errors. The only solution is to modify programs writing an important file to do an fsync system call and to check for an error return from that call instead of depending on getting errors from write or close. Finally, there is a risk of the loss of recently written data if the client crashes before the data are written back to the server.

A compromise between synchronous writing and delayed writing is asynchronous writing. The write to the server is started during the write system call, but the write system call returns before the write completes. This approach minimizes the risk of data loss because of a client crash but negates the possibility of reducing server write load by discarding writes when a file is truncated or deleted.

The simplest mechanism for maintaining full cache consistency is the one used by Sprite that disables all client caching of the file whenever concurrent write sharing might occur [Nelson et al., 1988]. Since NFS has no way of knowing when write sharing might occur, it tries to bound the period of inconsistency by writing the data back when a file is closed. Files that are open for long periods are written back when their oldest dirty data becomes 30 seconds old. Thus, the NFS implementation does a mix of asynchronous and delayed writing, but it always pushes all writes to the server on close. Pushing the delayed writes on close negates much of the performance advantage of delayed writing because the delays that were avoided in the write system calls are observed in the close system call. With this approach, the server is always aware of all changes made by its clients with a maximum delay of 30 seconds and usually sooner, because most files are open only briefly for writing.

The server maintains read consistency by always having a client verify the contents of its cache before using that cache. When a client reads data, it first checks for the data in its cache. Each cache entry is stamped with an attribute that shows the most recent time that the server says that the data were modified. If the data are found in the cache, the client sends a timestamp RPC request to its server to find out when the data were last modified. If the modification time returned by the server matches that associated with the cache, the client uses the data in its cache; otherwise, it arranges to replace the data in its cache with the new data.

The problem with checking with the server on every cache access is that the client still experiences an RPC round-trip delay for each file access, and the server is still inundated with RPC requests, although they are considerably quicker to handle than are full I/O operations. To reduce this client latency and server load, most NFS implementations track how recently the server has been asked about each cache block. The client then uses a tunable parameter that is typically set at a few seconds to delay asking the server about a cache block. If an I/O request finds a cache block and the server has been asked about the validity of that block within the delay period, the client does not ask the server again, but just uses the block. Because certain blocks are used many times in succession, the server will be asked about them only once, rather than on every access. For example, the directory block for the /usr/include directory will be accessed once for each #include in a source file that is being compiled. The drawback to this approach is that changes made by other clients may not be noticed for up to the delay number of seconds.

A more consistent approach used by some network filesystems is to use a callback scheme where the server keeps track of all the files that each of its clients has cached. When a cached file is modified, the server notifies the clients holding that file so that they can purge it from their cache. This algorithm dramatically reduces the number of queries from the client to the server, with the effect of decreasing client I/O latency and server load [Howard et al., 1988]. The drawback is that this approach introduces state into the server because the server must remember the clients that it is serving and the set of files that they have cached. If the server crashes, it must rebuild this state before it can begin running again. Rebuilding the server state is a significant problem when everything is running properly; it gets even more complicated and time-consuming when it is aggravated by network partitions that prevent the server from communicating with some of its clients [Mogul, 1993].

The FreeBSD NFS implementation uses asynchronous writes while a file is open but synchronously waits for all data to be written when the file is closed. This approach gains the speed benefit of writing asynchronously, yet ensures that any delayed errors will be reported no later than the point at which the file is closed. The implementation will query the server about the attributes of a file at most once every 3 seconds. This 3-second period reduces network traffic for files accessed frequently, yet ensures that any changes to a file are detected with no more than a 3-second delay. Although these heuristics provide tolerable semantics, they are noticeably imperfect. More consistent semantics at lower cost are available with the NQNFS lease protocol described in the next section.

Leases

The NQNFS protocol is designed to maintain full cache consistency between clients in a crash-tolerant manner. It is an adaptation of the NFS protocol such that the server supports both NFS and NQNFS clients while maintaining full consistency between the server and NQNFS clients. The protocol maintains cache consistency by using short-term leases instead of hard-state information about open files [Gray & Cheriton, 1989]. A lease is a ticket permitting an activity that is valid until some expiration time. As long as a client holds a valid lease, it knows that the server will give it a callback if the file status changes. Once the lease has expired, the client must contact the server if it wants to use the cached data.

Leases are issued using time intervals rather than absolute times to avoid the requirement of time-of-day clock synchronization. There are three important time constants known to the server. The maximum_lease_term sets an upper bound on lease duration typically, 30 seconds to 1 minute. The clock_skew is added to all lease terms on the server to correct for differing clock speeds between the client and server. The write_slack is the number of seconds that the server is willing to wait for a client with an expired write-caching lease to push dirty writes.

Contacting the server after the lease has expired is similar to the NFS technique for reducing server load by checking the validity of data only every few seconds. The main difference is that the server tracks its clients' cached files, so there are never periods of time when the client is using stale data. Thus, the time used for leases can be considerably longer than the few seconds that clients are willing to tolerate possibly stale data. The effect of this longer lease time is to reduce the number of server calls almost to the level found in a full callback implementation, such as the Andrew Filesystem [Howard et al., 1988]. Unlike the callback mechanism, state recovery with leases is trivial. The server needs only to wait for the lease's expiration time to pass, and then to resume operation. Once all the leases have expired, the clients will always communicate with the server before using any of their cached data. The lease expiration time is usually shorter than the time it takes most servers to reboot, so the server can effectively resume operation as soon as it is running. If the machine does manage to reboot more quickly than the lease expiration time, then it must wait until all leases have expired before resuming operation.

An additional benefit of using leases rather than hard state information is that leases use much less server memory. If each piece of state requires 64 bytes, a large server with thousands of clients and a peak throughput of 10,000 RPC requests per second will typically only use about 1 Mbyte of memory for leases, with a worst case of about 15 Mbyte. Even if a server has exhausted lease storage, it can simply wait a few seconds for a lease to expire and free up a record. By contrast, a server with hard state must store records for all files currently open by all clients. The memory requirements are 30 to 120 Mbyte of memory per 1000 clients served.

Whenever a client wishes to cache data for a file, it must hold a valid lease. There are three types of leases: noncaching, read caching, and write caching. A noncaching lease requires that all file operations be done synchronously with the server. A read-caching lease allows for client data caching, but no file modifications may be done. A write-caching lease allows for client caching of writes for the period of the lease. If a client has cached write data that are not yet written to the server when a write-cache lease has almost expired, it will attempt to extend the lease. If the extension fails, the client is required to push the written data.

If all the clients of a file are reading it, they will all be granted a read-caching lease. A read-caching lease allows one or more clients to cache data, but they may not make any modifications to the data. Figure 9.4 shows a typical read-caching scenario. The vertical solid black lines depict the lease records. Note that the time lines are not drawn to scale, since a client-server interaction will normally take less than 100 milliseconds, whereas the normal lease duration is 30 seconds. Every lease includes the time that the file was last modified on the server. The client can use this timestamp to ensure that its cached data are still current. Initially, client A gets a read-caching lease for the file. Later, client A renews that lease and uses it to verify that the data in its cache are still valid. Concurrently, client B is able to obtain a read-caching lease for the same file.

Figure 9.4. Read-caching leases. Solid vertical lines represent valid leases.


If a single client wants to write a file and there are no readers of that file, the client will be issued a write-caching lease. A write-caching lease permits delayed write caching but requires that all data be pushed to the server when the lease expires or is terminated by an eviction notice. When a write-caching lease has almost expired, the client will attempt to extend the lease if the file is still open, but it is required to push the delayed writes to the server if renewal fails (see Figure 9.5). The writes may not arrive at the server until after the write lease has expired on the client. A consistency problem is avoided because the server keeps its write lease valid for write_slack seconds longer than the time given in the lease issued to the client. In addition, writes to the file by the lease-holding client cause the lease expiration time to be extended to at least write_slack seconds. This write_slack period is conservatively estimated as the extra time that the client will need to write back any written data that it has cached. If the value selected for write_slack is too short, a write RPC may arrive after the write lease has expired on the server. Although this write RPC will result in another client seeing an inconsistency, that inconsistency is no more problematic than the semantics that NFS normally provides.

Figure 9.5. Write-caching lease. Solid vertical lines represent valid leases.


The server is responsible for maintaining consistency among the NQNFS clients by disabling client caching whenever a server file operation would cause inconsistencies. The possibility of inconsistencies occurs whenever a client has a write-caching lease and any other client or a local operation on the server tries to access the file, or when a modify operation is attempted on a file being read cached by clients. If one of these conditions occurs, then all clients will be issued noncaching leases. With a noncaching lease, all reads and writes will be done through the server, so clients will always get the most recent data. Figure 9.6 shows how read and write leases are replaced by a noncaching lease when there is the potential for write sharing. Initially, the file is read by client A. Later, it is written by client B. While client B is still writing, client A issues another read request. Here, the server sends an "eviction notice" message to client B and then waits for lease termination. Client B writes back its dirty data and then sends a "vacated" message. Finally, the server issues noncaching leases to both clients. In general, lease termination occurs when a "vacated" message has been received from all the clients that have signed the lease or when the lease has expired. The server does not wait for a reply for the message pair "eviction notice" and "vacated," as it does for all other RPC messages. They are sent asynchronously to avoid the server waiting indefinitely for a reply from a dead client.

Figure 9.6. Write-sharing leases. Solid vertical lines represent valid leases.


A client gets leases either by doing a specific lease RPC or by including a lease request with another RPC. Most NQNFS RPC requests allow a lease request to be added to them. Combining lease requests with other RPC requests minimizes the amount of extra network traffic. A typical combination can be done when a file is opened. The client must do an RPC to get the handle for the file to be opened. It can combine the lease request, because it knows at the time of the open whether it will need a read or a write lease. All leases are at the granularity of a file, because all NFS RPC requests operate on individual files, and NFS has no intrinsic notion of a file hierarchy. Directories, symbolic links, and file attributes may be read cached but are not write cached. The exception is the file-size attribute that is updated during cached writing on the client to reflect a growing file. Leases have the advantage that they are typically required only at times when other I/O operations occur. Thus, lease requests can almost always be piggybacked on other RPC requests, avoiding some of the overhead associated with the explicit open and close RPC required by a long-term callback implementation.

The server handles operations from local processes and from remote clients that are not using the NQNFS protocol by issuing short-term leases for the duration of each file operation or RPC. For example, a request to create a new file will get a short-term write lease on the directory in which the file is being created. Before that write lease is issued, the server will vacate the read leases of all the NQNFS clients that have cached data for that directory. Because the server gets leases for all non-NQNFS activity, consistency is maintained between the server and NQNFS clients, even when local or NFS clients are modifying the filesystem. The NFS clients will continue to be no more or less consistent with the server than they were without leases.

Crash Recovery

The server must maintain the state of all the current leases held by its clients. The benefit of using short-term leases is that, maximum_lease_term seconds after the server stops issuing leases, it knows that there are no current leases left. As such, server crash recovery does not require any state recovery. After rebooting, the server simply refuses to service any RPC requests except for writes (predominantly from clients that previously held write leases) until write_slack seconds after the final lease would have expired. For machines that cannot calculate the time that they crashed, the final-lease expiration time can be estimated safely as

boot_time + maximum_lease_term + write_slack + clock_skew

Here, boot time is the time that the kernel began running after the kernel was booted. With a maximum_lease_term 30 to 60 seconds, and clock_skew and write_slack at most a few seconds, this delay amounts to about 1 minute, which for most systems is taken up with the server rebooting process. When this time has passed, the server will have no outstanding leases. The clients will have had at least write_slack seconds to get written data to the server, so the server should be up-to-date. After this, the server resumes normal operation.

There is another failure condition that can occur when the server is congested. In the worst-case scenario, the client pushes dirty writes to the server but a large request queue on the server delays these writes for more than write_slack seconds. In an effort to minimize the effect of these recovery storms, the server replies "try again later" to the RPC requests that it is not yet ready to service [Baker & Ousterhout, 1991]. The server takes two steps to ensure that all clients have been able to write back their written data. First, a write-caching lease is terminated on the server only when there are have been no writes to the file during the previous write_slack seconds. Second, the server will not accept any requests other than writes until it has not been overloaded during the previous write_slack seconds. A server is considered overloaded when there are pending RPC requests and all its nfsd processes are busy.

Another problem that is solved by short-term leases is how to handle a crashed or partitioned client that holds a lease that the server wishes to vacate. The server detects this problem when it needs to vacate a lease so that it can issue a lease to a second client, and the first client holding the lease fails to respond to the vacate request. Here, the server can simply wait for the first client's lease to expire before issuing the new one to the second client. When the first client reboots or gets reconnected to the server, it simply reacquires any leases it now needs. If a client-to-server network connection is severed just before a write-caching lease expires, the client cannot push the dirty writes to the server. Other clients that can contact the server will continue to be able to access the file and will see the old data. Since the write-caching lease has expired on the client, the client will synchronize with the server as soon as the network connection has been reestablished. This delay can be avoided with a write-through policy.

A detailed comparison of the effects of leases on performance is given in Macklem [1994a]. Briefly, leases are most helpful when a server or network is loaded heavily. Here, leases allow up to 30 to 50 percent more clients to use a network and server before beginning to experience a level of congestion equal to what they would on a network and server that were not using leases. In addition, leases provide better consistency and lower latency for clients, independent of the load.

Although the NQNFS protocol never got adopted outside of the BSD family of operating systems, it did provide a proof of concept that was used to validate the use of leases in the NFS version 4 protocol. In the NFS version 4 protocol, leases are used for both cache consistency and the management of lock state. At the time of publication, there is active work on an implementation of NFS version 4 for BSD that will hopefully be ready for production use in FreeBSD within the next year or two.


   
 


The Design and Implementation of the FreeBSD Operating System
The Design and Implementation of the FreeBSD Operating System
ISBN: 0201702452
EAN: 2147483647
Year: 2003
Pages: 183

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net