15.4 Managing Memory

Memory resources have at least as much effect on overall system performance as the distribution of CPU resources. To perform well, a system needs to have adequate memory not just for the largest jobs it will run, but also for the overall mix of jobs typical of its everyday use. For example, the amount of memory that is sufficient for the one or two big jobs that run overnight might provide only a mediocre response time under the heavy daytime interactive use. On the other hand, an amount of memory that supports a system's normal interactive use might result in quite poor performance when larger jobs are run. Thus, both sets of needs should be taken into consideration when planning for and evaluating system memory requirements.

Paging and swapping are the means by which Unix distributes available memory among current processes when their total memory needs exceed the amount of physical memory. Technically, swapping refers to writing an entire process to disk, thereby freeing all of the physical memory it had occupied. A swapped-out process must then be reread into memory when execution resumes. Paging involves moving sections of a process's memory in units called pages to disk, to free up physical memory needed by some process. A page fault occurs when a process needs a page of memory that is not resident and must be (re)read in from disk. On virtual memory systems, true swapping occurs rarely if at all^[18] and usually indicates a serious memory shortage, so the two terms are used synonymously by most people.

^[18] Some systems swap out idle processes to free memory. The swapping I refer to here is the forced swapping of active processes due to a memory shortage.

Despite the strong negative connotations the term has acquired, paging is not always a bad thing. In the most general sense, paging is what makes virtual memory possible, allowing a process' memory requirements to greatly exceed the actual amount of physical memory. A process' total memory requirement includes the sum of the size of its executable image^[19] (known as its text segment) and the amount of memory it uses for data.

^[19] An exception occurs for executables that can be partially or totally shared by more than one process. In this case, only one copy of the image is in memory regardless of how many processes are executing it. The total memory used by the shared portions in these cases is divided among all processes using them in the output from commands like ps.

To run on systems without virtual memory, the process requires an amount of physical memory equal to its current text and data requirements. Virtual memory systems take advantage of the fact that most of this memory isn't actually needed all the time. Pieces of the process image on disk are read in only as needed. The system automatically maps their virtual addresses (relative address with respect to the beginning of the process's image) to real physical memory locations. When the process accesses a part of its executable image or its data that is not currently in physical memory, the kernel reads in pages in what is needed from disk, sometimes replacing other pages that the process no longer needs.

For a large program that spends most of its time in two routines, for example, only the part of its executable image containing the routines needs to be in memory while they are running, freeing up for other uses the memory the rest of the program's text segment would occupy on a nonvirtual memory computer. This is true whether the two routines are close together or far apart in the process' virtual address space. Similarly, if a program uses a very large data area, all of it needn't be resident in memory simultaneously if the program doesn't access it all at once. On many modern systems, program execution also always begins with a page fault as the operating system takes advantage of the kernel's virtual memory management facility to read enough of the executable image to get it started.

The problem with paging comes when there is not enough physical memory on the system for all of the processes currently running. In this case, the kernel will apportion the total memory among them dynamically. When a process needs a new page read in and there are no free or reusable pages, the operating system must steal a page that is being used by some other process. In this case, an existing page in memory is paged out. For volatile data, this results in the page being written to a paging area on disk; for executable pages or unmodified pages read in from file, the page is simply freed. In either case, however, when that page is again required, it must be paged back in, possibly forcing out another page.

When available physical memory is low, an appreciable portion of the available CPU time can be spent handling page faulting, and all processes will execute much less efficiently. In the worst kind of such thrashing conditions, the system spends all of its time managing virtual memory, and no real work gets done at all (no CPU cycles are actually used to advance the execution of any process). Accordingly, total CPU usage can remain low under these conditions.

You might think that changing the execution priorities for some of the jobs would solve a thrashing problem. Unfortunately, this isn't always the case. For example, consider two large processes on a system with only a modest amount of physical memory. If the jobs have the same execution priority, they will probably cause each other to page continuously if they run at the same time. This is a case where swapping is actually preferable to paging. If one job is swapped out, the other might run without page faulting, and after some amount of time, the situation can be reversed. Both jobs finish much sooner this way than they do under continuous paging.

Logically, lowering the priority of one of the jobs should cause it to wait to execute until the other one pauses (e.g., for an I/O operation) or completes. However, except for the special, low-priority levels we considered earlier, low-priority processes do occasionally get some execution time even when higher-priority processes are runnable. This happens to prevent a low-priority process from monopolizing a critical resource and thereby creating an overall system bottleneck or deadlock (this concern is indicative of a scheduling algorithm designed for lots of small interactive jobs). Thus, running both jobs at once, regardless of their priorities, will result in some execution degradation (even for the higher priority job) due to paging. In such cases, you need to either buy more memory or not run both jobs at the same time.

In fact, the virtual memory managers in modern operating systems work very hard to prevent such situations from occurring by using techniques for using memory efficiently. They also try to keep a certain amount of free memory all the time to minimize the risk of thrashing. These are some of the most common practices used to maximize the efficiency of the system's memory resources:

Demand paging: Pages are loaded into memory only when a page fault occurs. When a page is read in, a few pages surrounding the faulted page are typically loaded as well in the same I/O operation in an effort to head off future page faults.
Copy-on-write page protection: Whenever possible, only a single copy of identical pages in use by multiple processes is kept in memory. Duplicate, process-private copies of a page are created only if one of the processes modifies it.
Page reclaims: When memory is short, the virtual memory manager takes memory pages being used by current processes. However, such pages are simply initially marked as free and are not replaced with new data until the last possible moment. In this way, the owning process can reclaim them without a disk read operation if their original contents are still in memory when the pages are required again.

The next section discusses commands you can use to monitor memory use and paging activity on your system and get a picture of how well the system is performing. Later sections discuss managing the system paging areas.

15.4.1 Monitoring Memory Use and Paging Activity

The vmstat command is the best tool for monitoring system memory use; it is available on all of the systems we are considering. The most important statistics in this context are the number of running processes and the number of page-outs^[20] and swaps. You can use this information to determine whether the system is paging excessively. As you gather data with these commands, you'll also need to run the ps command so that you know what programs are causing the memory behavior you're seeing.

^[20] Because of the way that AIX keeps its paging statistics, page-ins are better indicators, because a page-in always means that a page was previously paged out.

The following sections discuss the memory monitoring commands and show how to interpret their output. They provide examples of output from systems under heavy loads. It's important to keep in mind, though, that all systems from time to time have memory shortages and consequent increases in paging activity. Thus, you can expect to see similar output on your system periodically. Such activity is significant only if it is persistent. Some deviation from what is normal for your system is to be expected, but consistent and sustained paging activity does indicate a memory shortage that you'll need to deal with.

15.4.1.1 Determining the amount of physical memory

The following commands can be used to quickly determine the amount of physical memory on a system:

AIX	`lsattr -HE -l sys0 -a realmem`
FreeBSD	`grep memory /var/run/dmesg.boot`
HP-UX	`dmesg \| grep Phys`
Linux	`free`
Solaris	`dmesg \| grep mem`
Tru64	`vmstat -P \| grep '^Total'`

Some Unix versions (including FreeBSD, AIX, Solaris, and Tru64) also support the pagesize command, which displays the size of a memory page:

$ pagesize  4096

Typical values are 4 KB and 8 KB.

15.4.1.2 Monitoring memory use

Overall memory usage levels are very useful indicators of the general state of the virtual memory subsystem. They can be obtained from many sources, including the top command we considered earlier. Here is the relevant part of the output:

CPU states:  3.5% user,  9.4% system, 13.0% nice, 87.0% idle Mem:   63212K av,  62440K used,    772K free,  21924K shrd,    316K buff Swap:  98748K av,   6060K used,  92688K free                  2612K cached

Graphical system state monitors can also provide overall memory use data. Figure 15-2 illustrates the KDE System Guard (ksysguard) utility's display. It presents both a graphical view of ongoing CPU and memory usage, as well as the current numerical data in the status area at the bottom of the window.

Figure 15-2. Overall system performance statistics

Linux also provides the free command, which lists current memory usage statistics:

$ free -m -o         total      used    free   shared    buffers    cached Mem:      249       231      18        0         11        75 Swap:     255         2     252

The command's options specify display units of MB (-m) and to omit buffer cache data (-o).

The most detailed memory subsystem data is given by vmstat. As we've seen, vmstat provides a number of statistics about current CPU and memory use. vmstat output varies somewhat between implementations. Here is an example of typical vmstat output:^[21]

^[21] vmstat's output varies somewhat from system to system, as we'll see.

$ vmstat 5 4  procs     memory            page            disk          faults      cpu  r b w   swap  free  re  mf pi po fr de sr s0 s6 s7 s8   in   sy   cs us sy id  0 0 0 1642648 759600 98 257 212 10 10 0 0  0  0  1  4  199  121   92  8  3 88  0 0 0 1484544 695816 0   1  0  0  0  0  0  0  0  0  0  113   35   46  0  1 99  0 0 0 1484544 695816 0   0  0  0  0  0  0  0  0  0  0  113   65   45  0  1 99  0 0 0 1484544 695816 0   0  0  0  0  0  0  0  0  0  0  111   72   44  0  1 99

The first line of every vmstat report is an average since boot time; it can be ignored for our purposes, and I'll be omitting it from future displays.^[22]

^[22] You can define an alias to take care of this automatically. Here's an example for the C shell:

alias vm "/usr/bin/vmstat \!:* | awk 'NR!=4'"

The report is organized into sections as follows:

procs or kthr: Statistics about active processes. Together, the first three columns tell you how many processes are currently active.
memory: Memory use and availability data.
page or swap: Paging activity.
io or disk: Per-device I/O operations.
faults or system or intr: Overall system interrupt and context switching rates.
cpu: Percentage of CPU devoted to system time, user time, and time the CPU remained idle. AIX adds an additional column showing CPU time spent in idle mode while jobs are waiting for pending I/O operations.

Not all versions of vmstat contain all sections.

Table 15-5 lists the most important columns in vmstat's report.

Table 15-5. vmstat report contents
Label(s)	Meaning
r	Number of runnable processes.
b	Number of blocked processes (idle because they are waiting for I/O).
w	Number of swapped-out runnable processes (should be 0).
avm, act, swpd	Number of active virtual memory pages (a snapshot at the current instant). For `vmstat`, a page is usually 1 KB, regardless of the system's actual page size. However, under AIX and HP-UX, a `vmstat` page is 4 KB.
fre, free	Number of memory pages on the free list.
re	Number of page reclaims: pages placed on the free list but reclaimed by their owner before the page was actually reused.
pi, si, pin	Number of pages paged in (usually includes process startup).
po, so, pout	Number of pages paged out (if greater than zero, the system is paging).
fr	Memory pages freed by the virtual memory management facility during this interval.
dn	Disk operations per second on disk n. Sometimes, the columns are named for the various disk devices rather than in this generic way (e.g., adn under FreeBSD). Not all versions of `vmstat` include disk data.
cs	Number of context switches.
us	Percentage of total CPU time spent on user processes.
sy	Percentage of total CPU time spent as system overhead.
id	Idle time percentage (percentage of CPU time not used).

Here are examples of the output format for each of our systems:

AIX

 kthr     memory             page              faults        cpu ----- ------------- ----------------------- ------------ ---------  r  b    avm    fre re  pi  po  fr  sr  cy  in   sy   cs us sy id wa  0  0 149367 847219  0   0   0   0   0   0 109  258   11 18  7 72  3

HP-UX

     procs           memory          page                          faults       cpu r  b  w      avm    free   re  at  pi  po   fr   de   sr     in  sy    cs  us sy id 2  0  0   228488  120499    1   0   0   0   10    0   0    1021  44    29  14  1 86

Linux

      procs                   memory    swap          io    system          cpu r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id 1  0  0      0   4280   5960  48296   0   0     5     1  101   123   1   0  99

FreeBSD

procs      memory      page                    disks     faults      cpu r b w     avm    fre  flt  re  pi  po  fr  sr ad0 ad1   in   sy  cs us sy id 0 0 0    5392  32500    1   0   0   0   1   0   0   0  229    9   3  0  1 99

Solaris

kthr      memory            page            disk          faults      cpu r b w   swap  free  re  mf pi po fr de sr dd f0 s0 --   in   sy   cs us sy id 0 0 0 695496 187920  0   1  1  0  0  0  1  0  0  0  0  402   34   45  0  0 100

Tru64

Virtual Memory Statistics: (pagesize = 8192) procs      memory        pages                            intr       cpu r   w   u  act free wire fault  cow zero react  pin pout  in  sy  cs us sy id 3 135  31  15K  10K 5439  110M   8M  52M  637K  42M  63K   4 953  1K  2  0 98

Note that some versions have additional columns.

We'll look at interpreting vmstat output in the next subsection.

15.4.1.3 Recognizing memory problems

You can expect memory usage to vary quite a lot in the course of normal system operations. Short-term memory usage spikes are normal and to be expected. In general, one or more of the following symptoms may suggest a significant shortage of memory resources when they appear regularly and/or persist for a significant period of time:

Available memory drops below some acceptable threshold. On an interactive system this may be 5%-15%. However, on a system designed for computation, a steady free memory amount of 5% may be fine.
Significant paging activity. The most significant metrics in this case are writes to the page file (page-outs) and reads from the page file (although most systems don't provide the latter statistic).
The system regularly thrashes, even if only for short periods of time.
The page file gradually increases in size or remains at a high usage level under normal operations. This can indicate that additional paging space is needed or that memory itself is in low supply.

In practical terms, let's consider specific parts of the vmstat output:

In general, the number in the w column should be 0, indicating no runnable swapped-out processes; if it isn't, the system has a serious memory shortage.
The po column is the most important in terms ofpaging: it indicates the number of page-outs and should ideally be very close to zero. If it isn't, processes are contending for the available memory and the system is paging. Paging activity is also reflected in significant decreases in the amount of free memory (fre) and in the number of page reclaims (re) memory pages taken away from one process because another one needs them even though the first process needs them too.
High numbers in the page-ins column (pi) are not always significant because starting up a process involves paging in its executable image and data.^[23] When a new process starts, this column will jump up but then quickly level off again.

^[23] The AIX version of vmstat limits pi to page-ins from paging space.

The following is an example of the effect mentioned in the final bullet:

$ vmstat 5                              Output is edited. procs        memory         page  r b w     avm    fre  re    pi  po  0 1 0   81152  17864   0     0   0  1 1 0   98496  15624   0   192   0  2 0 0   84160  11648   0   320   0  2 0 0   74784   9600   0   320   0  2 0 0   74464   5984   0    64   0  2 0 0   78688   5472   0     0   0  1 1 0   60480  16032   0     0   0  ^C

At the second data line, a compile job starts executing. There is a jump in the number of page-ins, and the available memory (fre) drops sharply. Once the job gets going, the page-ins drop back to zero, although the free list size stays small. When the job ends, its memory returns to the free list (final line). Check your system's documentation to determine whether process startup paging is included in vmstat's paging data.

Here is some output from a system briefly under distress:

$ vmstat 5                              Some columns omitted. procs        memory         page    ...    cpu r b w     avm    fre  re    pi  po       us sy id 1 1 0   43232  31296   0     0   0        3  0 97 1 2 0   46560  32512   0     0   0        5  0 95 5 0 0   82496   2848   2   384 608        5 37 58  2 3 0   81568   2304   2   384 448        4 63 43 4 1 0   72480   2144   0    96  96        6 71 23 5 1 0   72640   2112   0    64  32       12 76 12 4 1 0   73280   3328   0     0   0       23 26 51 2 1 0   54176  19552   0    32   0       34  1 65 ^C

At the beginning of this report, this system was running well, with no paging activity at all. Then several new processes start up (line 5), both page-in and page-out activity increases, and the free list shrinks. This system doesn't have enough memory for all the jobs that want to run at this point, which is also reflected in the size of the free list. By the end of this report, however, things are beginning to calm down again as these processes finish.

15.4.1.4 The filesystem cache

Most current Unix implementations use any free memory as adata cache for disk I/O operations in an effort to maximize I/O performance. Recently accessed data is kept in memory for a time in case it is needed again, as long as there is sufficient memory to do so. However, this is the first memory to be freed if more memory is needed. This tactic improves the performance of local processes and network system access operations. However, on systems designed for computation, such memory may be better used for user jobs.

On many systems, you can configure the amount of memory that is used in this way, as we'll see.

15.4.2 Configuring the Virtual Memory Manager

Some Unix variations allow you to specify some of the parameters that control the way the virtual memory manager operates. We consider each Unix version individually in the sections that follow.

These operations require care and thought and should be initially tried on nonproduction systems. Recklessness and carelessness will be punished.

15.4.2.1 AIX

AIX provides commands for customizing some aspects of the Virtual Memory Manager. You need to be cautious when modifying any of the system parameters discussed in this section, because it is quite possible to make the system unusable or even crash if you give invalid values. Fortunately, changes made with the commands in the section last only until the system is rebooted.

AIX's schedtune command (introduced in the previous section of this chapter) can be used to set the values of various Virtual Memory Manager (VMM) parameters that control how the VMM responds to thrashing conditions. In general, its goal is to detect such conditions and deal with them before they get completely out of hand (for example, a temporary spike in memory usage can result in thrashing for many minutes if nothing is done about it).

The VMM decides that the system is thrashing when the fraction of page steals (pages grabbed while they were still in use) that are actually paged out to disk^[24] exceeds some threshold value. When this happens, the VMM begins suspending processes until thrashing stops.^[25] It tries to select processes to suspend that are both having an effect on memory performance and whose absence will actually cause conditions to improve. It chooses processes based on their own repage rates: when the fraction of its page faults are for pages that have been previously paged out rises above a certain value by default, one fourth a process becomes a candidate for suspension. Suspended processes are resumed once system conditions have improved and remained stable for a certain period of time (by default, 1 second).

^[24] Computed as po/fr, using the vmstat display fields.

^[25] Suspended processes still consume memory, but they stop paging.

Without any arguments, schedtune displays the current values of all of the parameters under its control, including those related to memory load management. Here is an example of its output:

# schedtune      THRASH           SUSP       FORK             SCHED -h    -p    -m      -w    -e      -f       -d       -r        -t       -s SYS  PROC  MULTI   WAIT  GRACE   TICKS   SCHED_D  SCHED_R  TIMESLICE MAXSPIN  0     4     2       1     2      10       16       16         1      16384    CLOCK    SCHED_FIFO2   IDLE MIGRATION   FIXED_PRI    -c          -a            -b              -F %usDELTA   AFFINITY_LIM  BARRIER/16       GLOBAL   100           7             4               0

Table 15-6 summarizes the meanings of the thrashing-related parameters.

Table 15-6. AIX VMM parameters
Option	Label	Meaning
`-h`	SYS	Memory is defined as overcommitted when page writes/total page steals > 1/`-h`. Setting this value to 0 disables the thrash recovery mechanisms (which is the default).
`-p`	PROC	A process may be suspended during thrashing conditions when its repages/page faults > 1/`-p`. This parameter defines when an individual process is thrashing. The default is 4.
`-m`	MULTI	Minimum number of processes to remain running even when the system is thrashing. The default is 2.
`-w`	WAIT	Number of seconds to wait after thrashing ends (as defined by `-h`) before any reactivating suspended processes. The default is 1.
`-e`	GRACE	Number of seconds after reactivation before a process may be suspended again. The default is 2.

Currently, the AIX thrashing recovery mechanisms are disabled by default. In general, it is better to prevent memory overuse problems than to recover from them. However, this is not always possible, so you may find this feature useful on very busy, heavily loaded systems. To enable it, set the value of -h to 6 (the previous AIX default value).

For most systems, it is not necessary to change the default values of the other thrashing control parameters. However, if you have clear evidence that the VMM is systematically behaving either too aggressively or not aggressively enough in deciding whether memory has become overcommitted, you might want to experiment with small changes, beginning with -h or -p. In some cases, increasing the value of -w may be beneficial on systems running a large number of processes. I don't recommend changing the value of -m.

The vmtune command allows the system administrator to customize some aspects of the behavior of the VMM's page replacement algorithm. vmtune is located in the same directory as schedtune: /usr/samples/kernel. Without options, the command displays the values of various memory management parameters:

# vmtune vmtune:  current values:   -p       -P        -r          -R         -f       -F       -N        -W minperm  maxperm  minpgahead maxpgahead  minfree  maxfree  pd_npages maxrandwrt  209507   838028       2          8        120      128     524288        0   -M      -w      -k      -c        -b         -B           -u        -l    -d maxpin npswarn npskill numclust numfsbufs hd_pbuf_cnt lvm_bufcnt lrubucket defps 838849    4096    1024       1     196        192          9      131072     1         -s              -n         -S         -L          -g           -h sync_release_ilock  nokilluid  v_pinshm  lgpg_regions  lgpg_size  strict_maxperm         0               0           0           0            0        0   -t maxclient  838028 number of valid memory pages = 1048561  maxperm=79.9% of real memory maximum pinable=80.0% of real memory    minperm=20.0% of real memory number of file memory pages = 42582     numperm=4.1% of real memory number of compressed memory pages = 0   compressed=0.0% of real memory number of client memory pages = 46950   numclient=4.5% of real memory # of remote pgs sched-pageout = 0       maxclient=79.9% of real memory

These are vmtune's most useful options for memory management:

-f minfree: Minimum size of the free list a set of memory pages set aside for new pages required by processes (used to satisfy page faults). When the free list falls below this threshold, the VMM must steal pages from running processes to replenish the free list. The default is 120 pages.
-F maxfree: Page stealing stops when the free list reaches or exceeds this size. The default is 128 pages.
-p minperm: Threshold value that forces both computational and file pages to be stolen (expressed as a percentage of the system's total physical memory). The default is 18%-20% (depending on memory size).
-P maxperm: Threshold value that forces only file pages to be stolen (expressed as a percentage of the system's total physical memory). The default is 75%-80%.

The second pair of parameters determine to a certain extent which sorts of memory pages are stolen when the free list needs to be replenished. AIX distinguishes between computational memory pages, which consist of program working storage (non-file-based data) and program text segments (the executable's in-memory image). File pages are all other kinds of memory pages (all of which are backed by disk files). By default, the VMM attempts to slightly favor computational pages over file pages when selecting pages to steal, according to the following scheme:

Both types: %file < minperm OR file-repaging computational-repaging
File pages only: (minperm < %file < maxperm AND file-repaging < computational-repaging) OR %file > maxperm

%file is the percentage of pages which are file pages. Repage rates are the fraction of page faults that reference stolen or replaced memory pages rather than new pages (determined from the VMM's limited history of pages that have recently been present in memory). It may make sense to reduce maxperm on computationally-oriented systems.

15.4.2.2 FreeBSD

On FreeBSD systems, kernel variables may be displayed and modified with the sysctl command (and set at boot time via its configuration file /etc/sysctl.conf). For example, the following commands display and then reduce the value for the maximum number of simultaneous processes allowed per user:

# sysctl kern.maxprocperuid kern.maxprocperuid: 531 # sysctl kern.maxprocperuid=64 kern.maxprocperuid: 531 -> 64

Such a step might make sense on systems where users need to be prevented from overusing/abusing system resources (although, in itself, this step would not solve such a problem).

The following line in /etc/sysctl.conf performs the same function:

kern.maxprocperuid=64

Figure 15-3 lists the kernel variables related to paging activity and the interrelationships among them.

Figure 15-3. FreeBSD memory management levels

Normally, the memory manager tries to maintain at least vm.v_free_target free pages. The pageout daemon, which suspends processes when memory is short, wakes up when free memory drops below the level specified by vm.v_free_reserved (it sleeps otherwise). When it runs, it tries to achieve the total number of free pages specified by vm.v_inactive_target.

The default values of these parameters depend on the amount of physical memory in the system. On a 98 MB system, they have the following settings:

vm.v_inactive_target: 1524             Units are pages. vm.v_free_target: 1016 vm.v_free_min: 226 vm.v_free_reserved: 112 vm.v_pageout_free_min: 34

Finally, the variables vm.v_cache_min and vm.v_cache_max specify the minimum and maximum sizes of the filesystembuffer cache (the defaults are 1016 and 2032 pages, respectively, on a 98 MB system). The cache can grow dynamically between these limits if free memory permits. If the cache size falls significantly below the minimum size, the pageout daemon is awakened. You may decide to increase one or both of these values if you want to favor the cache over user processes in memory allocation. Increase the maximum first; changing the minimum level requires great care and understanding of the memory manager internals.

15.4.2.3 HP-UX

On HP-UX systems,kernel parameters are set with the kmtune command.

Paging is controlled by three variables, in the following way:

free memory lotsfree: Page stealing stops.
desfree free memory < lotsfree: Page stealing occurs.
minfree free memory < desfree: Anti-thrashing measures taken, including process deactivation (in addition to page stealing).

The default values for these variables are set by HP-UX and depend on the amount of physical memory in the system (in pages). The documentation strongly discourages modifying them.

HP-UX can use either a statically or dynamically sized buffer cache (the latter is the default and is recommended). A dynamic cache is used when the variables nbuf and bufpages are both set to 0. In this case, you can specify the minimum and maximum percentage of memory used for the cache via the variables dbc_min_pct and dbc_max_pct, which default to 5% and 50%, respectively. Depending on the extent to which you want to favor the cache or user processes in allocating memory, modifying the maximum value may make sense.

15.4.2.4 Linux

On Linux systems, modifying kernel parameters is done by changing the values within files in /proc/sys and its subdirectories (as we've seen previously). For memory management, the relevant files are located in the vm subdirectory. These are the most important of them:

freepages

This file contains three values specifying a minimum free page level, a low free page level, and a desired free page level. When there are fewer than the minimum number, user processes are denied additional memory. Between the minimum and low levels, aggressive paging (page stealing) takes place, while between the low and desired levels, "gentle" paging occurs. Above the desired (highest) level, page stealing stops.

The default values (in pages) depend on the amount of physical memory in the system, but they scale as x, 2x, and 3x (more or less). Successfully modifying these values requires a thorough knowledge of both the Linux memory subsystem and the system workload, and doing so is not recommended for the faint of heart.

buffermem

Specifies the amount of memory to be used for the filesystem buffer cache. The three values specify the minimum amount, the borrow percentage, and the maximum amount. They default to 2%, 10%, and 60%, respectively. When memory is short and the size of the buffer cache exceeds the borrow percentage level, pages will be stolen from the buffer cache until its size drops below this size.

If you want to favor the buffer cache over processes in allocating memory, increasing the borrow and/or maximum levels may make sense. On the other hand, if you want to favor processes, reducing the maximum and setting the borrow level close to it makes more sense.

overcommit_memory

Setting the value in this file to 1 allows processes to allocate amounts of memory larger than can actually be accommodated (the default is 0). Some application programs allocate huge amounts of memory that they never actually use, and they may run successfully if this setting is enabled.

Changing parameter values is accomplished by modifying the values in these values. For example, the following command changes the settings related to the buffer cache:

# echo "5  33  80" > /proc/sys/vm/buffermem

15.4.2.5 Solaris

On Solaris systems, you can view the values of system parameters via the kstat command.For example, the following command displays system parameters related to paging behavior, including their default values on a system with 1 GB of physical memory:

# kstat -m unix -n system_pages | grep 'free '      cachefree      1966              Units are pages.      lotsfree       1966      desfree         983      minfree         491      ...

Figure 15-4 illustrates the meanings and interrelationships of these memory levels.

Figure 15-4. Solaris paging and swapping memory lLevels

As the figure indicates, setting cachefree to a value greater than lotsfree provides a way of favoring processes' memory over thebuffer cache (by default, no distinction is made between them because lotsfree is equal to cachefree). In order to do so, you should decrease lotsfree to some point between its current level and desfree (rather than increasing cachefree).

Solaris 9 has changed its virtual memory manager and has eliminated the cachefree variable.

15.4.2.6 Tru64

Tru64 memory management is controlled by parameters in thesysconfig vm subsystem. These are the most useful parameters:

vm_aggressive_swap: Enable/disable aggressive swapping out of idle processes (0 by default). Enabling this can provide some memory management improvements on heavily loaded systems, but it is not a substitute for reducing excess consumption.
There are several parameters that control the conditions under which the memory manager steals pages from active processes and/or swaps out idle processes in an effort to maintain sufficient free memory. They are listed in Figure 15-5 along with their interrelationships and effects.

Figure 15-5. Tru64 paging and swapping memory levels
The default for vm_page_free_min is 20 pages. The value of vm_page_free_target varies with the memory size; for a system with 1 GB of physical memory, it defaults to 512 pages. The reserved value is always 10 pages.
The other variables are computed from these values. vm_page_free_swap (and the equivalent vm_page_free_optimal) is set to the point halfway between the minimum and the target, and vm_page_free_hardswap is set to about 16 times the target value.
Several parameters relate to the size of the buffer cache. vm_minpercent specifies the percentage of memory initially used for the buffer cache (the default is 10%).The buffer cache size will increase if memory is available. The parameter ubc_maxpercent specifies the maximum amount of memory that it may use (the default is 100%). When memory is short and the size of the cache corresponds to ubc_borrowpercent or larger, pages will be returned to the general pool until the cache drops below this level (and process memory page stealing does not occur). The default for the borrow level is 20% of physical memory.
On file servers, it will often make sense to increase one or both of the minimum and borrow percentages (to favor the cache over local processes in memory allocation). On a database server, though, you will probably want to reduce these sizes.

15.4.3 Managing Paging Space

Specially designated areas of disk are used forpaging. On most Unix systems, distinct, dedicated disk partitions called swap partitions are used to hold pages written out from memory. In some recent Unix implementations, paging can also go to special page files stored in a regular Unix filesystem.^[26]

^[26] Despite their names, both swap partitions and page files can be used for paging and for swapping (on systems supporting virtual memory).

NOTE

figs/armadillo_tip.gif

Many discussions of setting up paging space advise using multiple paging areas, spread across different physical disk drives. Paging I/O performance will generally improve the closer you come to this ideal.

However, regular disk I/O also benefits from careful disk placement. It is not always possible to separate both paging space and important filesystems. Before you decide which to do, you must determine which kind of I/O you want to favor and then provide the improvements appropriate for that kind.

In my experience, paging I/O is best avoided rather than optimized, and other kinds of disk I/O deserve far more attention than paging space placement.

15.4.3.1 How much paging space?

There are as many answers to this question as there are people to ask. The correct answer is, of course, "It depends." What it depends on is the type of jobs your system typically executes. A single-user workstation might find a paging area of one to two times the size of physical memory adequate if all the system is used for is editing and small compilations. On the other hand, real production environments running programs with very large memory requirements might need two or even three times the amount of physical memory. Keep in mind that some processes will be killed if all available paging space is ever exhausted (and new processes will not be able to start).

One factor that can have a large effect on paging space requirements is the way that the operating system assigns paging space to virtual memory pages implicitly created when programs allocate large amounts of memory (which may not all be needed in any individual run). Many recent systems don't allocate paging space for such pages until each page is actually accessed; this practice tends to minimize per-process memory requirements and stretch a given amount of physical memory as far as possible. However, other systems assign paging space to the entire block of memory as soon as it is allocated. Obviously, under the latter scheme, the system will need more page file space than under the former.

Other factors that will tend to increase your page file space needs include:

Jobs requiring large amounts of memory, especially if the system must run more than one at a time.
Jobs with virtual address spaces significantly larger than the amount of physical memory.
Programs that are themselves very large (i.e., have large executables). This often implies the item above, but not vice versa.
A very, very large number of simultaneously running jobs, even if each individual job is fairly small.

15.4.3.2 Listing paging areas

Most systems provide commands to determine the locations of paging areas and how much of the total space is currently in use:

	List paging areas	Show current usage
AIX	`lsps -a`	`lsps -a`
FreeBSD	`pstat -s`	`pstat -s`
HP-UX	`swapinfo -t -a -m`	`swapinfo -t -a -m`
Linux	`cat /proc/swaps`	`swapon -s` or `free -m -o`
Solaris	`swap -l`	`swap -l` or `-s`
Tru64	`swapon -s`	`swapon -s`

Here is some output from a Solaris system:

swapfile             dev   swaplo blocks   free /dev/dsk/c0t0d0s1   136,1      16 1049312  1049312

The Solaris swap command also has a -s option, which lists statistics about current overall paging space usage:

total: 22240k bytes allocated + 6728k reserved = 28968k used,             691568k available

Under AIX, the command to list the paging space information is lsps -a:

$ lsps -a  Page Space   Phys. Volume   Volume Group   Size    %Used   Active   Auto  hd6          hdisk0         rootvg         200MB   76      yes      yes paging00     hdisk3         uservg         128MB   34      yes      yes

The output lists the paging space name, the physical disk it resides on, the volume group it is part of, its size, how much of it is currently in use, whether it is currently active, and whether it is activated automatically at boot time. This system has two paging spaces totaling about 328 MB; total system swap space is currently about 60% full.

Here is some output from an HP-UX system:

# swapinfo -tam              Mb      Mb      Mb   PCT  START/      Mb TYPE      AVAIL    USED    FREE  USED   LIMIT RESERVE  PRI  NAME dev         192      34     158   18%       0       -    1  /dev/vg00/lvol2 reserve       -      98     -98 memory       65      32      33   49% total       257     164      93   64%       -       0    -

The first three lines of the output provide details about the system swap configuration. The first line (dev) shows that 34 MB is currently in use within the paging area at /dev/vg00/lvol2 (its total size is 192 MB). The next line indicates that another 98 MB has been reserved within this paging area but is not yet in use.

The third line of the display is present when pseudo-swap has been enabled on the system. This is accomplished by setting the swapmem_on kernel variable to 1 (in fact, this is the default). Pseudo-swap allows applications to reserve more swap space than physically exists on the system. It is important to emphasize that pseudo-swap does not itself take up any memory, up to a limit of seven eighths of physical memory. Line 3 indicates that there is 164 MB of memory overcommitment capacity remaining for applications to use (32 MB is in use).

The final line (total) is a summary line. In this case, it indicates that there is 257 MB of total swap space on this system. 164 MB of it is currently either reserved or allocated: the 34 MB allocated from the paging area plus 98 MB reserved in the paging area plus 32 MB of the pseudo-swap capacity.

15.4.3.3 Activating paging areas

Normally, paging areas are activated automatically at boot time. On many systems, swap partitions are listed in the filesystem configuration file, usually /etc/fstab. The format of the filesystem configuration file is discussed in detail in Section 10.2, although some example entries will be given here:

/dev/ad0s2b     none    swap    sw        0 0    FreeBSD /dev/vg01/swap  ...     swap    pri=0     0 0    HP-UX /dev/hda1       swap    swap    defaults  0 0    Linux

This entry says that the first partition on disk 1 is a swap partition. This basic form is used for all swap partitions.

Solaris systems similarly place swap areas into /etc/vfstab:

/dev/dsk/c0t0d0s1  -  -  swap  -  no  -

Tru64 systems lists swap areas within the vm section of /etc/sysconfigtab:

vm:     swapdevice = /dev/disk/dsk0b

On FreeBSD, HP-UX, Tru64, and Linux systems, all defined swap partitions are activated automatically at boot time with a command like the following:

swapon -a > /dev/console 2>&1

The swapon -a command says to activate all swap partitions. This command may also be issued manually when adding a new partition. Solaris provides the swapadd tool to perform the same function during boots.

Under AIX, paging areas are listed in the file /etc/swapspaces :

hd6:      dev = /dev/hd6 paging00:      dev = /dev/paging00

Each stanza lists the name of the paging space and its associated special file (the stanza name and the filename in /dev are always the same). All paging logical volumes listed in /etc/swapspaces are activated at boot time by a swapon -a command in /etc/rc. Paging logical volumes can also be activated when they are created or by manually executing the swapon -a command.

15.4.3.4 Creating new paging areas

As we've noted, paging requires dedicated disk space, which is used to store paged-out data. Making a new swap partition on an existing disk without free space is a painful process, involving these steps:

Performing a full backup of all filesystems currently on the device and verifying that the tapes are readable.
Restructuring the physical disk organization (partition sizes and layout), if necessary.
Creating new filesystems on the disk. At this point, you are treating the old disk as if it were a brand new one.
Restoring files to the new filesystems.
Activating the new swapping area and adding it to the appropriate configuration files.

Most of these steps are covered in detail in other chapters. A better approach is the subject of the next subsection.

15.4.3.5 Filesystem paging

Many modern Unix operating systems offer a great deal more flexibility by supporting filesystem paging paging to designated files within normal filesystems. Page files can be created or deleted as needs change, albeit at a modest increase in paging operating system overhead.

Under Solaris, the mkfile command creates new page files. For example, the following command will create the file /chem/page_1 as a 50 MB file:

# mkfile 50m /chem/page_1 # swap -a /chem/page_1 0 102400

The mkfile command creates a 50 MB page file with the specified pathname. The argument specifying the size of the file is interpreted as bytes unless a k (KB) or m (MB) suffix is appended to it. The regular swap command is then used to designate an existing file as a page file by substituting its pathname for the special filename.

On HP-UX systems, filesystem paging is initiated by designating a directory as the swap device to the swapon command. In this mode, it has the following basic syntax:

swapon [-m  min] [-l  limit] [-r reserve] dir

min is the minimum number of filesystem blocks to be used for paging (the block size is as defined when the filesystem was created: 4096 or 8192), limit is the maximum number of filesystem blocks to be used for paging space, and reserve is the amount of space reserved for files beyond that currently in use which may never be used for paging space. For example, the following command initiates paging to the /chem filesystem, limiting the size of the page file to 5000 blocks and reserving 10000 blocks for future filesystem expansion:

# swapon -l 5000 -r 10000 /chem

You can also create a new logical volume as an additional paging space under HP-UX. For example, the following commands create and activate a 125 MB swap logical volume named swap2:

# lvcreate -l 125 -n swap2 -C y -r n  /dev/vg01 # swapon /dev/vg01/swap2

The logical volume uses a contiguous allocation policy and has bad block relocation disabled (-C and -r, respectively). Note that no filesystem is built on the logical volume.

On Linux systems, a page file may be created with commands like these:

# dd if=/dev/zero of=/swap1 bs=1024 count=8192  Create 8MB file. # mkswap /swap1 8192                            Make file a swap device. # sync; sync # swapon /swap1                                 Activate page file.

On FreeBSD systems, a page file is created as follows:

# dd if=/dev/zero of=/swap1 bs=1024 count=8192  Create 8MB file. # vnconfig -e vnc0 /swap1 swap                  Create pseudo disk /dev/vn0c                                                     and enable swapping.

The vnconfig command configures the paging area and activates it.

Under AIX, paging space is organized as special paging logical volumes. Like normal logical volumes, paging spaces may be increased in size as desired as long as there are unallocated logical partitions in their volume group.

You can use the mkps command to create a new paging space or the chps command to enlarge an existing one. For example, the following command creates a 200 MB paging space in the volume group chemvg:

# mkps -a -n -s 50 chemvg

The paging space will be assigned a name like pagingnn where nn is a number: paging01, for example. The -a option says to activate the paging space automatically on system boots (its name is entered into /etc/swapspaces). The -n option says to activate the paging space immediately after it is created. The -s option specifies the paging space's size, in logical partitions (whose default size is 4 MB). The volume group name appears as the final item on the command line.

The size of an existing paging space may be increased with the chps command. Here the -s option specifies the number of additional logical partitions to be added:

# chps -s 10 paging01

This command adds 40 MB to the size of paging space paging01.

FreeBSD does not support filesystem paging, although you can use a logical volume for swapping in either environment. The latter makes it much easier to add an additional paging space without adding a new disk.

15.4.3.6 Linux and HP-UX paging space priorities

HP-UX and Linux allow you to specify a preferred usage order for multiple paging spaces via a priority system. The -p option to swapon may be used to assign a priority number to a swap partition or other paging area when it is activated. Priority numbers run from 0 to 10 under HP-UX, with lower numbered areas being used first; the default value is 1.

On Linux systems, priorities go from 0 to 32767, with higher numbered areas being used first, and they default to 0. It is usually preferable to give dedicated swap partitions a higher usage priority than filesystem paging areas.

15.4.3.7 Removing paging areas

Paging spaces may be removed if they are no longer needed, unless they're on the root disk. To remove a swap partition or filesystem page file in a BSD-style implementation FreeBSD, Linux, HP-UX, and Tru64 remove the corresponding line from the appropriate system configuration file. Once the system is rebooted, the swap partition will be deactivated (rebooting is necessary to ensure that there are no active references to the partition or page file). Page files may then be removed normally with rm.

Under Solaris, the -d option to the swap command deactivates a swap area. Here are some examples:

# swap -d /dev/dsk/c1d1s1 0 # swap -d /chem/page_1 0

Once the swap -d command is executed, no new paging will be done to that area, and the kernel will attempt to free areas in it that are still in use, if possible. However, the file will not actually be removed until no processes are using it.

Under AIX, paging spaces may be removed with rmps once they are deactivated:

# chps -a n paging01 # rmps paging01

The chps command removes paging01 from the list to be activated at boot time (in /etc/swapspaces).The rmps command actually removes the paging space.

Administrative Virtues: Persistence

Monitoring system activity levels and tuning system performance both rely on the same system administrative virtue:persistence. These tasks naturally must be performed over an extended period of time, and they are also inherently cyclical (or even recursive). You'll need persistence most at two points:

When you are just getting started and don't have any idea what is wrong with the system and what to try to improve the situation.
After the euphoria from your early successes has worn off and you have to spend more time to achieve smaller improvements.

System performance tuning and system performance itself both follow the 80/20 rule: getting the last 20% done takes 80% of the time. (System administration itself often follows another variation of the rule: 20% of the people do 80% of the work.) Keep in mind the law of diminishing returns, and don't waste any time trying to eke out that last 5% or 10%.