Performance Issues with NFSBottlenecks and Perceptions | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

Performance Issues with NFS—Bottlenecks and Perceptions

Now that we've examined the locking issues required to build a cluster filesystem that can support existing multiuser applications, we need to examine how well the NLM and NFS perform. The two basic areas of concern when analyzing the performance of a NFS are:

Where are the performance bottlenecks?
How will the end-user perceive the overall NFS performance?

Like an engineer building a suspension bridge, a cluster architect wants to know how NFS will bear the load of the traffic that it will carry. As such, we need to analyze the amount of traffic and know where the stress points or bottlenecks are likely to occur. We also need to know the difference between the raw numbers used to describe the speed of an NAS device and the end-user's perception of how well the system as a whole performs.

We'll first look at the size of the pipes used to send and receive data over the network. As shown in Figure 16-1, the size of the network "pipe" connecting the NAS server to the Ethernet backbone will normally support a gigabit per second (Gbps). Also, as shown in Figure 16-1, the cluster node has two network connections. The first is a 100 Mbps pipe to the network that connects the cluster node's real IP address (RIP) to the Director's IP (DIP), called the D/RIP network in LVS terms. The second network connection is used to connect the cluster node to a network that is dedicated to NFS traffic (labeled the "NFS Network" in Figure 16-1).

image from book
Figure 16-1: Ethernet performance bottlenecks

The Gigabit pipe on the NAS server and the switched Ethernet backbone (which can be hundreds of Gbps) for the NFS network are not likely to be the first stress points of performance. In fact, most NAS vendors support multiple Gigabit Ethernet pipes to the Ethernet backbone. Cluster nodes can also be built with multiple Gigabit pipes to the Ethernet backbone, though most multiuser applications running in a cluster environment will not saturate a single 100 Mbps pipe. (Remember that one of the benefits of using a cluster is that more nodes can always be added if too many instances of the same application or service on a single node are competing for access to the 100 Mbps pipe to the Ethernet backbone.)^[14]

Where, then, is the likely stress point in a shared filesystem? To answer this question, we turn our attention to the end-user's perception of the overall NFS performance.

Note

In the following discussion, the term database refers to flat files or an indexed sequential access method (ISAM) database used by legacy applications. For a discussion of relational and object databases (MySQL, Postgres, ZODB), see Chapter 20.

Single Transactions and User Perception of NFS Performance

For example, consider users who enter orders into a system and who need to run reports against this data. Here's what happens: A customer calls, and a customer service agent pulls up the customer's record. The agent then starts a new order and types in information. Up to this point, the difference between clustered NFS disk performance at 100 Mbps and locally attached storage is negligible.

The agent searches for a desired item (product, airline seat, classified ad, and so forth). If the application can find the data the agent is looking for using a key such as a product number or a flight number, the performance hit created by the NFS overhead is negligible because the application does not need to bring a significant amount of information over the network to complete the search.

Next, the agent selects the item to modify. Now, a Posix fcntl lock operation ensures that another agent cannot select the same item or product. As with the above, the NFS locking overhead on a single byte range within a file is not likely to create a noticeable performance difference as far as the agent is concerned.

Once the agent has picked his or her item, the database quantity is updated to reflect the change in inventory, the lock is released, and perhaps a new record is added to another database for the line item in the order. This process is repeated until the order is complete and a final calculation is performed on the order (total amount, tax, and so on). Again, this calculation is likely to be faster on a cluster node than it would be on a monolithic server because more CPU time is available to make the calculation.

Given this scenario, or one like it, there is little chance that the extra overhead imposed by NFS will impact the customer service agent's perception of the cluster's performance.^[15] But what about users running reports?

Multiple Transactions and User Perception of NFS Performance

The situation is different for the user running reports or multiple transactions in a batch. This user's perception of the cluster's performance is likely to depend upon how quickly the NAS server can perform I/O operations and how many lock and GETATTR operations are required for each read or write transaction.

Note

Historically, one drawback to using NAS servers (and perhaps a contributing factor for the emergence of storage area networks) was the CPU overhead associated with packetizing and de-packetizing filesystem operations so they could be transmitted over the network. However, as inexpensive CPUs push past the multi-GHz speed barrier, the packetizing and de-packetizing performance overhead fades in importance. And, in a cluster environment, you can simply add more nodes to increase available CPU cycles if more are needed to packetize and de-packetize Network File System operations (of course, this does not mean that adding additional nodes will increase the speed of NFS by itself).

The speed of your NAS server (the number of I/O operations per second) will probably be determined by your budget. Discussing all of the issues involved with selecting the best NAS hardware is outside the scope of this book, but one of the key performance numbers you'll want to use when making your decision is how many NFS operations the NAS server can perform per second. The NAS server performance is likely to become the most expensive performance bottleneck to fix in your cluster.^[16]

The second performance stress point—the quantity of lock and GETATTR operations—can be managed by careful system administration and good programming practices.

^[14]The cost to add additional nodes may be less than the cost to build multiple Gigabit pipes to each cluster node.

^[15]Unless, of course, the NAS server is severely overloaded. See Chapter 18.

^[16]At the time of this writing, a top-of-the-line NAS server can perform a little over 30,000 I/O operations per second.