13.9 CFS Performance Optimizations

In this section we will discuss various methods for improving the performance of the CFS. Far from a definitive treatise on the subject, this is merely our attempt to give you a few places to look – and perhaps a couple of magic bullets – when performance improvements are sought. As with any discussion around performance optimizations and tuning, your mileage may vary greatly based on the road conditions, tire pressure, cycle of the moon, etc.

Performance optimizations can be sought in the following locations:

	Section
CFS server load-balancing	13.9.1
File system capacity issues	13.9.2
I/O transfer size adjustments	13.9.3
CFS memory usage adjustments	13.9.4
Read-ahead and write-behind thread adjustments	13.9.5

Since the CFS is layered above the UBC and the physical file systems, it is important to optimize these other subsystems to obtain optimal performance for your environment.

For additional information regarding CFS performance optimizations, see the TruCluster Server Cluster Administration Guide and the sys_attrs_cfs(5) reference page. For additional information regarding UBC and file system optimizations, see the Tru64 UNIX File System Administration Handbook and the Tru64 UNIX System Configuration and Tuning manual.

13.9.1 CFS Server Load-Balancing

13.9.1.1 Manual Load Balancing

As we demonstrated in section 13.2.1, the first cluster member booted will likely be the CFS server for a majority of the file systems in the cluster because it will be the first member to mount the file systems on the shared buses. This can become a performance issue, and there is currently no automatic relocation of file systems to load-balance the CFS servers among cluster members.

You can manually relocate the CFS server to another member using the cfsmgr command with the "-a server" option.

For example, let's relocate the tcrhb#fafrak file system from server sheridan to server molari.

 # cfsmgr /fafrak Domain or filesystem name = /fafrak Server Name = sheridan Server Status : OK

 # cfsmgr –a server=molari /fafrak

You can add the "-r" switch to relocate the underlying disk storage to keep it served locally as well. This is only useful if the devices are not Direct-Access I/O (DAIO) capable. See chapter 15 for more information on DAIO devices.

Let's verify that the relocation was successful using the cfsmgr command.

 # cfsmgr /fafrak Domain or filesystem name = /fafrak Server Name = molari Server Status : OK

Note

Since there is a CFS server per AdvFS domain, relocating a file system that is a fileset of a multi-fileset domain will cause all filesets in the domain to be relocated. For example, if you recall from examples earlier in this chapter, the tcrhb domain contains two filesets: fafrak and lola. If the tcrhb#lola fileset had been mounted at the time we had done the relocation of the /fafrak file system, then the /lola file system would also have been relocated.

 # mount tcrhb#lola /lola

 # cfs | grep -E "^CFS|^-|tcrhb|" CFS Server        Mount Point              File System              FS Type ---------------- ------------------------- ------------------------ ------- molari            /fafrak                  tcrhb#fafrak             AdvFS molari            /lola                    tcrhb#lola               AdvFS

 # cfsmgr -a server=sheridan /fafrak

 # cfs | grep -E "^CFS|^-|tcrhb" CFS Server        Mount Point              File System              FS Type ---------------- ------------------------- ------------------------ ------- sheridan          /fafrak                  tcrhb#fafrak             AdvFS sheridan          /lola                    tcrhb#lola               AdvFS

13.9.1.2 Automatic Load Balancing^[7]

Although there is no automatic CFS server load-balancing facility, you can use a CFS server relocation script at system startup (or within a CAA application resource action script^[8]) to automatically relocate file systems to specific members. The criteria you choose to determine which file systems should be served by which member is up to you, but consider the following:

Serve the file system or domain from the member that is using it the most.
Evenly distribute the file systems or domains among all members.
Configure only a subset of the members as CFS servers, leaving other members to be compute servers.
Configure file systems or domains on the same disk to the same CFS server.

Although some of these suggestions are mutually exclusive, they are meant to illustrate that there are many options when configuring your cluster. It is important that you take into account the use of the cluster and configure it based on its intended use.

13.9.1.2.1 How Do You Determine Which Member Is Using the File System the Most?

Use either the cfsmgr command with the "-a statistics" switch or our cfs script with the "-s" switch. With the cfsmgr command, you must run the command once for each member in the cluster to get the statistics for the specified file system. By contrast, our cfs command was written to accomplish this for you. In the example below, we use the cfs command, but we will show you the equivalent cfsmgr commands sans output.

For example, given the file system tcrhb#fafrak, let's see which member is serving the file system and then determine the statistics by member for the file system.

 # cfs -s /fafrak /fafrak [tcrhb#fafrak] (dsk6c,dsk5b):                     read      write     lookup    getattr   readlnk   access   other                     --------- --------- --------- --------- --------- -------- ---------         molari:    148568    100000          2          0         0        2       3     * sheridan:         0         0          0          0         0        0       0          total:    148568    100000          2          0         0        2       3

The CFS server is sheridan as indicated by the "*" in the first column of the output. If there is an "@" in the first column, then that member is the CFS server, and the file system is a partitioned file system (i.e., it is mounted with the "-o server_only" command option).

As you can see, the CFS server is not really doing any I/O to the file system, whereas the other member has done a lot more I/O.

Note

The equivalent cfsmgr commands are:

Get the statistics for `member1`.	# cfsmgr -h molari -a statistics /fafrak
Get the statistics for `member2`.	# cfsmgr –h sheridan –a statistics /fafrak

13.9.2 File System Capacity Issues

The CFS server grants block reservations to CFS clients so that they are guaranteed available space when it comes time to write their data back to the file system.

Since the available space is reserved, the CFS clients do not have to transfer the data back to the CFS server after every write but instead can cache it locally and transfer the dirty pages back in 64K blocks in the background – potentially even after the file is closed. When a CFS client runs out of reserved space, it will request more from the CFS server. The CFS server will grant additional space, possibly revoking space reservations from other CFS clients if necessary, unless the file system free space gets too low.

A severe performance problem can occur when the free space on the file system falls below 10% or 50MB (whichever is smaller) because the CFS server reserves this space and will no longer grant block reservations to CFS clients. If a CFS client cannot get a block reservation, it must write through to the CFS server to ensure correct ENOSPC error handling, thereby eliminating the performance benefits of caching the data locally. See errno(2) reference page for details on ENOSPC.

The moral of the story? Do not let your file systems get too full.

13.9.3 Do Not Adjust the Block I/O Transfer Size

Documented in the TruCluster Server Cluster Administration Guide in chapter 9 is a section on tuning the block I/O transfer size. Subsequent to the release, we received a note from Compaq's CFS engineers stating that they do not recommend adjusting the block I/O transfer size.

DO NOT ADJUST THE BLOCK I/O TRANSFER SIZE!

Adjusting the block I/O transfer size can cause adverse side effects such as headaches, dizziness, drowsiness, thinking in molasses, and general crankiness.

13.9.4 Adjusting CFS Memory Usage

If you have an application that is reporting an EMFILE error, "too many open files", you might be experiencing one of two memory-related problems:

The member is out of vnodes.
The CFS server has reached svrcfstok_max_percent.

13.9.4.1 Is the CFS Client Out of vnodes?

To see if the CFS client is out of vnodes, you can use your favorite kernel debugger and get the value for the global variables total_vnodes and free_vnodes. Then get the value of the max_vnodes attribute in the vfs subsystem using the sysconfig(8) command.

If total_vnodes is equal to vfs:max_vnodes and free_vnodes is equal to zero, then you should increase the vfs:max_vnodes value.

We wrote a Perl script to get the values and do the math so we wouldn't have to remember the rules or variable names.

 # ./vnode If the total (in-use) vnodes is equal to vfs:max_vnodes and the free vnodes is equal to zero, then increase vfs:max_vnodes vnodes in-use       vfs:max_vnodes     AND       free vnodes = zero -------------        --------------              ------------------     6129       <      18699                          2706 This member has available vnodes. Verify that the CFS server has not exceeded cfs:svrcfstok_max_percent

Using the manual approach:

 # sysconfig -q vfs max_vnodes vfs: max_vnodes = 18699

 # print 'printf "total_vnodes = %d, \ > free_vnodes = %d",total_vnodes,free_vnodes;quit' \ > | dbx -k /vmunix 2> /dev/null | tail -1 total_vnodes = 6129, free_vnodes = 2706

Note

The max_vnodes variable is a kernel global variable as well and can also be obtained from the kernel debugger.

The max_vnodes attribute can be reconfigured dynamically using the sysconfig command.

 # sysconfig -r vfs max_vnodes=50000

The minimum, maximum, and default values of the max_vnodes attribute are shown in Table 13-3.

Table 13-3: `max_vnodes` attribute values
vfs:max_vnodes attribute values
Minimum Value	Maximum Value	Default Value
0	1717986918	1000	If system main memory is 24MB or less
0	1717986918	Number of `vnodes` that can be contained in 5% of the system's main memory.	If system main memory is greater than 24MB.

If the CFS client has not exceeded max_vnodes and still has free vnodes available, then you should check to see if the CFS server has reached svrcfstok_max_percent.

13.9.4.2 Has the CFS Server Reached svrcfstok_max_percent?

The CFS server must keep track of all vnodes that are cached on CFS clients – this requires approximately 1600 bytes of system memory for data structures (a token structure, AdvFS access structures, vnode structures, etc.) per cached vnode. The CFS server can use up to cfs:svrcfstok_max_percent of main memory to hold these data structures. The default for the svrcfstok_max_percent attribute is 25% but can be set from 5% to 50%.

As of this writing, in order to see if the CFS server has reached svrcfstok_max_percent, you need to use a kernel debugger to scope out the values of the svrtok_active_svrcfstok and cfs_max_svrcfstok kernel global variables.

 # print 'printf "svrtok_active_svrcfstok = %d, \ > cfs_max_svrcfstok = %d",svrtok_active_svrcfstok,cfs_max_svrcfstok;\ > quit' | dbx -k /vmunix 2> /dev/null | tail -1 svrtok_active_svrcfstok = 4608, cfs_max_svrcfstok = 76833

If svrtok_active_svrcfstok is greater than or equal to cfs_max_svrcfstok, then you can try one or more of the following suggestions to get your file systems up and running again:

Use the cfsmgr(8) command to relocate some file systems to another cluster member. See section 13.9.1.1 for more information.

Increase the value of the kernel global variable cfs_max_svrcfstok using your favorite kernel debugger.

 # dbx –k /vmunix ... (dbx) assign cfs_max_svrcfstok=92500 92500 (dbx) pd cfs_max_svrcfstok 92500 (dbx) quit

Increase the value of cfs:svrcfstok_max_percent in /etc/sysconfigtab using the sysconfigdb(8) or dxkerneltuner(8) command and then reboot the member.

Note

Never ones to shy away from a challenge, we wrote a script called cfssvrtok that checks the values of svrtok_active_svrcfstok and cfs_max_svrcfstok and then offers to modify cfs_max_svrcfstok for you.

13.9.4.3 The CFS Server is Running Out of Memory?

If a member acting as a CFS server does not have a large amount of memory, you may want to consider setting the cfs:svrcfstok_max_percent attribute's value to something lower than the 25% default. We do not recommend you do this unless you find that the member is consistently running out of memory and you cannot add additional memory.

13.9.5 Adjusting Read-Ahead and Write-Behind Threads

Sequential access of files on the CFS clients is handled by read-ahead and write-behind threads. By default, a CFS read is done in 64KB increments for remote reads (i.e., reading from the CFS server), so if the CFS detects multiple sequential reads, it uses the read-ahead thread to read the next block of data in anticipation that it too will be requested.

If you are using TruCluster Server version 5.1A, however, the CFS will use Direct Access Cached Reads (see section 13.8) so reads will be done directly from the storage devices, thus bypassing the CFS server and therefore potentially reducing the cluster interconnect traffic. Also, with Direct Access Cached Reads, I/Os can generally be larger (the preferred I/O transfer size is 128KB). Note, even with Direct Access Cached Reads, the read-ahead threads are still used.

The write-behind thread is used to write the dirty pages in the background. It attempts to find other contiguous dirty pages from the cache to consolidate the pages into larger I/O transfers to the file system.

These threads are part of the kernel idle task and can be seen by using the following ps(1)command pipe to the grep(1) command.

 # ps –elm | grep cfsiod_

By default, 32 of these read-ahead and write-behind threads exist on each cluster member. This can be easily verified by piping the previous command string into the wc(1) command.

 # ps -elm | grep cfsiod_ | wc -l               32

If your CFS client sequentially accesses more than 32 large files at a time, you may want to increase this number to improve performance. A good basic check to see if you need more read-ahead/write-behind threads is to check the state of the waiting threads:

 # ps -emo state,wchan | grep cfsiod_ S          cfsiod_ S          cfsiod_ ... S          cfsiod_ S          cfsiod_ S          cfsiod_

If a majority of the waiting threads are in an "S" state (i.e., the threads have been sleeping for less than about 20 seconds) and the count of waiting threads seems to fluctuate heavily, you may need to increase the number of read-ahead and write-behind threads.

The cfs_async_biod_threads attribute in the cfs subsystem can be used to dynamically modify the number of read-ahead and write-behind threads. The cfs_async_biod_threads can have any value from 0 to 128. To modify the cfs_async_biod_threads, you can use the sysconfig command as illustrated in the following example.

Check the initial value of the cfs_async_biod_threads attribute.

 # sysconfig –q cfs cfs_async_biod_threads cfs: cfs_async_biod_threads = 32

Reconfigure the cfs_async_biod_threads attribute, setting it to 100.

 # sysconfig -r cfs cfs_async_biod_threads=100 cfs_async_biod_threads: reconfigured

Verify that the reconfiguration succeeded.

 # sysconfig -q cfs cfs_async_biod_threads cfs: cfs_async_biod_threads = 100

You can also see that the threads have been created.

 # ps -elm | grep cfsiod_ | wc -l              100

Note

This exercise was performed on a relatively idle cluster member and therefore we were able to illustrate that the number of threads was equal to the value of the cfs_async_biod_threads attribute in the cfs subsystem. It is possible that you could see a smaller number of threads than are actually configured when running:

 # ps -elm | grep cfsiod_ | wc -l

This will likely indicate that the threads are currently running. The "cfsiod_" string that we are searching for is actually coming from the WCHAN, which is an address (or partial name) of the event on which the thread is waiting. In other words, if the thread is not waiting, its WCHAN will not be "cfsiod_".

According to the TruCluster Cluster Administration Guide, chapter 9, the read-ahead and write-behind threads consume few resources when not in use, so increasing the number of threads will probably not hurt your performance. If you plan to increase the number of threads, however, do it in modest increments and evaluate the effects on your system before jumping up to the maximum value. Since the attribute can be modified dynamically, if you notice any adverse effects, you can lower the value without having to reboot your system.

Once you have found an attribute value that fits your configuration, you will need to place it in your system's /etc/sysconfigtab file by using the sysconfigdb or dxkerneltuner command.

^[7]CFS Load Balancing is planned for V5.1B – see the cfsd(8) reference page for details.

^[8]For additional information on CAA and action scripts, see chapter 24. See also chapter 21, section 21.4 for an example load-balancing script (cfsldb).

13.9 CFS Performance Optimizations