4.4 Consumers of Memory | System Performance Tuning2002

Memory is consumed by four things: the kernel, filesystem caches, processes, and intimately shared memory. When the system starts, it takes a small amount ( generally less than 4 MB) of memory for itself. As it dynamically loads modules and requires additional memory, it claims pages from the free list. These pages are locked in physical memory, and cannot be paged out except in the most severe of memory shortages. Sometimes, on a system that is very short of memory, you can hear a pop from the speaker. This is actually the speaker being turned off as the audio device driver is being unloaded from the kernel. However, a module won't be unloaded if a process is actually using the device; otherwise , the disk driver could be paged out, causing difficulties. Occasionally, however, a system will experience a kernel memory allocation error . While there is a limit on the size of kernel memory, ^[7] the problem is caused by the kernel trying to get memory when the free list is completely exhausted. Since the kernel cannot always wait for memory to become available, this can cause operations to fail rather than be delayed. One of the subsystems that cannot wait for memory is the streams facility; if a large number of users try to log into a system at the same time, some logins may fail. Starting with Solaris 2.5.1, changes were made to expand the free list on large systems, which helps prevent the free list from ever being totally empty.

^[7] This number is typically very large. On UltraSPARC-based Solaris systems, it is about 3.75 GB.

Processes have private memory to hold their stack space, heap, and data areas. The only way to see how much memory a process is actively using is to use /usr/proc/bin/pmap -x process-id , which is available in Solaris 2.6 and later releases.

Intimately shared memory is a technique for allowing the sharing of low-level kernel information about pages, rather than by sharing the memory pages themselves . This is a significant optimization in that it removes a great deal of redundant mapping information. It is of primary use in database applications such as Oracle, which benefit from having a very large shared memory cache. There are three special things worth noting about intimately shared memory. First, all the intimately shared memory is locked, and cannot ever be paged out. Second, the memory management structures that are usually created independently for each process are only created once, and shared between all processes. Third, the kernel tries to find large pieces of contiguous physical memory (4 MB) that can be used as large pages, which substantially reduces MMU overhead.

4.4.1 Filesystem Caching

The single largest consumer of memory is usually the filesystem-caching mechanism. In order for a process to read from or write to a file, the file needs to be buffered in memory. When this is happening, these pages are locked in memory. After the operation completes, the pages are unlocked and placed at the bottom of the free list. The kernel remembers the pages that store valid cached data. If the data is needed again, it is readily available in memory, which saves the system an expensive trip to disk. When a file is deleted or truncated, or if the kernel decides to stop caching a particular inode, any pages caching that data are placed at the head of the free list for immediate reuse. Most files, however, only become uncached upon the action of the page scanner. Data that has been modified in the memory caches is periodically written to data by fsflush on Solaris and bdflush on Linux, which we'll discuss a little later.

The amount of space used for this behavior is not tunable in Solaris; if you want to cache a large amount of filesystem data in memory, you simply need to buy a system with a lot of physical memory. Furthermore, since Solaris handles all its filesystem I/O by means of the paging mechanism, a large number of observed page-ins and page-outs is completely normal. In the Linux 2.2 kernel, this caching behavior is tunable: only a specific amount of memory is available for filesystem buffering. The min_percent variable controls the minimum percentage of system memory available for caching. The upper bound is not tunable. This variable can be found in the /proc/sys/vm/buffermem file. The format of that file is min_percent max_percent borrow_percent ; note that max_percent and borrow_percent are not used.

4.4.2 Filesystem Cache Writes : fsflush and bdflush

Of course, the caching of files in memory is a huge performance boost; it often allows us to access main memory (a few hundred nanoseconds) when we would otherwise have to go all the way to disk (tens of milliseconds ). Since the contents of a file can be operated upon in memory via the filesystem cache, it is important for data-reliability purposes to regularly write changed data to disk. Older Unix operating systems, like SunOS 4, would write the modified contents of memory to disk every 30 seconds. Solaris and Linux both implement a mechanism to spread this workload out, which is implemented by the fsflush and bdflush processes, respectively.

This mechanism can have substantial impacts on a system's performance. It also explains some unusual disk statistics.

4.4.2.1 Solaris: fsflush

The maximum age of any memory-resident modified page is set by the autoup variable, which is thirty seconds by default. It can be increased safely to several hundred seconds if necessary. Every tune_t_fsflushr seconds (by default, every five seconds), fsflush wakes up and checks a fraction of the total memory equal to tune_t_fsflushr divided by autoup (that is, by default, five-thirtieths, or one- sixth , of the system's total physical memory). It then flushes any modified entries it finds from the inode cache to disk; it can be disabled by setting doiflush to zero. The page-flushing mechanism can be totally disabled by setting dopageflush to zero, but this can have serious repercussions on data reliability in the event of a crash. Note that dopageflush and doiflush are complimentary , not mutually exclusive.

4.4.2.2 Linux: bdflush

Linux implements a slightly different mechanism, which is tuned via the values in the /proc/sys/vm/bdflush file. Unfortunately, the tunable behavior of the bdflush daemon has changed significantly from the 2.2 kernels to the 2.4 kernels . I discuss each in turn .

In Linux 2.2, if the percentage of the filesystem buffer cache that is "dirty" (that is, changed and needs to be flushed) exceeds bdflush.nfract , then bdflush wakes up. Setting this variable to a high value means that cache flushing can be delayed for quite a while, but it also means that when it does occur, a lot of disk I/O will happen at once. A lower value spreads out disk activity more evenly. bdflush will write out a number of buffer entries equal to bdflush.ndirty ; a high value here causes sporadic, bursting I/O, but a small value can lead to a memory shortage, since bdflush isn't being woken up frequently enough. The system will wait for bdflush.age_buffer or bdflush.age_super , in hundredths of a second, before writing a dirty data block or dirty filesystem metadata block to disk. Here's a simple Perl script for displaying, in a pretty format, the values of the bdflush configuration file:

 #!/usr/bin/perl my ($nfract, $ndirty, $nrefill, $nref_dirt, $unused, $age_buffer, $age_super,  $unused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9); print "Current settings of bdflush kernel variables:\n"; print "nfract\t\t$nfract\tndirty\t\t$ndirty\tnrefill\t\t$nrefill\n\r"; print "nref_dirt\t$nref_dirt\tage_buffer\t$age_buffer\tage_super\t$age_super\n\r";

In Linux 2.4, about the only thing that didn't change was the fact that bdflush still wakes up if the percentage of the filesystem buffer cache that is dirty exceeds bdflush.nfract . The default value of bdflush.nfract (the first in the file) is 30%; the range is from 0 to 100%. The minimum interval between wakeups and flushes is determined by the bdflush.interval parameter (the fifth in the file), which is expressed in clock ticks . ^[8] The default value is 5 seconds; the minimum is 0 and the maximum is 600. The bdflush.age_buffer tunable (the sixth in the file) governs the maximum amount of time, in clock ticks, that the kernel will wait before flushing a dirty buffer to disk. The default value is 30 seconds, the minimum is 1 second, and the maximum is 6,000 seconds. The final parameter, bdflush.nfract_sync (the seventh in the file), governs the percentage of the buffer cache that must be dirty before bdflush will activate synchronously; in other words, it is the hard limit after which bdflush will force buffers to disk. The default is 60%. Here's a script to extract values for these bdflush parameters in Linux 2.4:

^[8] There are typically 100 clock ticks per second.

 #!/usr/bin/perl my ($nfract, $unused, $unused, $unused, $interval, $age_buffer, $nfract_sync, $u nused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9); print "Current settings of bdflush kernel variables:\n"; print "nfract $nfract\tinterval $interval\tage_buffer $age_buffer\tnfract_sync $ nfract_sync\n";

If the system has a very large amount of physical memory, fsflush and bdflush (we'll refer to them generically as flushing daemons ) will have a lot of work to do every time they are woken up. However, most files that would have been written out by the flushing daemon have already closed by the time they're marked for flushing. Furthermore, writes over NFS are always performed synchronously, so the flushing daemon isn't required. In cases where the system is performing lots of I/O but not using direct I/O or synchronous writes, the performance of the flushing daemons becomes important. A general rule for Solaris systems is that if fsflush has consumed more than five percent of the system's cumulative nonidle processor time, autoup should be increased.

4.4.3 Interactions Between the Filesystem Cache and Memory

Because Solaris has an untunable filesystem caching mechanism, it can encounter problems under some specific instances. The source of the problem is that the kernel allows the filesystem cache to grow to the point where it begins to steal memory pages from user applications. This behavior not only shortchanges other potential consumers of memory, but it means that the filesystem performance becomes dominated by the rate at which the virtual memory subsystem can free memory.

There are two solutions to this problem: priority paging and the cyclic cache.

4.4.3.1 Priority paging

In order to address this issue, Sun introduced a new paging algorithm in Solaris 7, called priority paging , which places a boundary around the filesystem cache. ^[9] A new kernel variable, cachefree , is created, which scales with minfree , desfree , and lotsfree . The system attempts to keep cachefree pages of memory available, but frees filesystem cache pages only when the size of the free list is between cachefree and lotsfree .

^[9] The algorithm was later backported to Solaris 2.6.

The effect is generally excellent . Desktop systems and on-line transaction processing (OLTP) environments tend to feel much more responsive , and much of the swap device activity is eliminated; computational codes that do a great deal of filesystem writing may see as much as a 300% performance increase. By default, priority paging has been disabled until sufficient end-user feedback on its performance is gathered. It will likely become the new algorithm in future Solaris releases. In order to use this new mechanism, you need Solaris 7, or 2.6 with kernel patch 105181-09. To enable the algorithm, set the priority_paging variable to 1. You can also implement the change on a live 32-bit system by setting the cachefree tunable to twice the value of lotsfree .

4.4.3.2 Cyclic caching

A more technically elegant solution to this problem has been implemented in Solaris 8, primarily due to efforts by Richard McDougall, a senior engineer at Sun Microsystems. No special procedures need be followed to enable it. At the heart of this mechanism is a straightforward rule: nondirty pages that are not mapped anywhere should be on the free list. This rule means that the free list now contains all the filesystem cache pages, which has far-reaching consequences:

Application startup (or other heavy memory consumption in a short period of time) can occur much faster, because the page scanner is not required to wake up and free memory.
Filesystem I/O has very little impact on other applications on the system.
Paging activity is reduced to zero and the page scanner is idle when sufficient memory is enabled.

As a result, analyzing a Solaris 8 system for a memory shortage is simple: if the page scanner reclaims any pages at all , there is a memory shortage. The mere activity of the page scanner means that memory is tight.

4.4.4 Interactions Between the Filesystem Cache and Disk

When data is being pushed out from memory to disk via fsflush , Solaris will try to gather modified pages that are adjacent to each other on disk, so that they can be written out in one continuous piece. This is governed by the maxphys kernel parameter. Set this parameter to a reasonably large value (1,048,576 is a good choice; it is the largest value that makes sense for a modern UFS filesystem). As we'll discuss in Section 6.6 in Chapter 6, a maxphys value of 1,048,576 with a 64 KB interlace size is sufficient to drive a 16-disk RAID 0 array to nearly full speed with a single file.

There is another case where memory and disk interact to give suboptimal performance. If your applications are constantly writing and rewriting files that are cached in memory, the in-memory filesystem cache is very effective. Unfortunately, the filesystem flushing process is regularly attempting to purge data out to disk, which may not be a good thing. For example, if your working set is 40 GB and fits entirely in available memory, with the default autoup value of 30, fsflush is attempting to synchronize up to 40 GB of data to disk every 30 seconds. Most disk subsystems cannot sustain 1.3 GB/second, which will mean that the application is throttled and waiting for disk I/O to complete, despite the fact that all of the working set is in memory!

There are three telltale signs for this case:

vmstat -p shows very low filesystem activity.
iostat -xtc shows constant disk write activity.
The application has a high wait time for file operations.

Increasing autoup (to, say, 840) and tune_t_fsflushr (to 120) will decrease the amount of data sent to disk, improving the chances of issuing a single larger I/O (rather than many smaller I/Os). You will also improve your chances of seeing write cancellation, when not every modification to a file is written to disk. The flip-side is that you run a higher risk of losing data in the case of server failure.