Section 5.12. Page Replacement

5.12. Page Replacement

The service of page faults and other demands for memory may be satisfied from the free list for some time, but eventually memory must be reclaimed for reuse. Some pages are reclaimed when processes exit. On systems with a large amount of memory and low memory demand, exiting processes may provide enough free memory to fill demand. This case arises when there is enough memory for the kernel and for all pages that have ever been used by any current process. Obviously, many computers do not have enough main memory to retain all pages in memory. Thus, it eventually becomes necessary to move some pages to secondary storage back to the filesystem and to the swap space. Bringing in a page is demand driven. For paging it out, however, there is no immediate indication when a page is no longer needed by a process. The kernel must implement some strategy for deciding which pages to move out of memory so that it can replace these pages with the ones that are currently needed in memory. Ideally, the strategy will choose pages for replacement that will not be needed soon. An approximation to this strategy is to find pages that have not been used actively or recently.

The 4.4BSD system implemented demand paging with a page-replacement algorithm that approximated global least-recently used [Easton & Franaszek, 1979]. In FreeBSD, the one-bit use field for each page has been augmented with an activity counter to approximate global least-actively used. Both these algorithms are examples of a global replacement algorithm: one in which the choice of a page for replacement is made according to systemwide criteria. A local replacement algorithm would choose a process for which to replace a page and then chose a page based on per-process criteria. Although the algorithm in FreeBSD is similar in nature to that in 4.4BSD, its implementation is considerably different.

The kernel scans physical memory on a regular basis, considering pages for replacement. The use of a systemwide list of pages forces all processes to compete for memory on an equal basis. Note that it is also consistent with the way that FreeBSD treats other resources provided by the system. A common alternative to allowing all processes to compete equally for memory is to partition memory into multiple independent areas, each localized to a collection of processes that compete with one another for memory. This scheme is used, for example, by the VMS operating system [Kenah & Bate, 1984]. With this scheme, system administrators can guarantee that a process, or collection of processes, will always have a minimal percentage of memory. Unfortunately, this scheme can be difficult to administer. Allocating too small a number of pages to a partition can result in underutilization of memory and excessive I/O activity to secondary-storage devices, whereas setting the number too high can result in excessive swapping [Lazowska & Kelsey, 1978].

The kernel divides the main memory into five lists:

Wired: Wired pages are locked in memory and cannot be paged out. Typically, these pages are being used by the kernel or the physical-memory pager, or they have been locked down with mlock. In addition, all the pages being used to hold the user structure and thread stacks of loaded (i.e., not swapped-out) processes are also wired. Wired pages cannot be paged out.
Active: Active pages are being used by one or more regions of virtual memory. Although the kernel can page them out, doing so is likely to cause an active process to fault them back again.
Inactive: Inactive pages have contents that are still known, but they are not usually part of any active region. If the contents of the page are dirty, the contents must be written to backing store before the page can be reused. Once the page has been cleaned, it is moved to the cache list. If the system becomes short of memory, the pageout daemon may try to move active pages to the inactive list in the hopes of finding pages that are not really in use. The selection criteria that are used by the pageout daemon to select pages to move from the active list to the inactive list are described later in this section. When the free-memory and cache lists drop too low, the pageout daemon traverses the inactive list to create more cache and free pages.
Cache: Cache pages have contents that are still known, but they are not usually part of any active region. If they are mapped into an active region, they must be marked read-only so that any write attempt will cause them to be moved off the cache list. They are similar to inactive pages except that they are not dirty, either because they are unmodified, since they were paged in, or because they have been written to their backing store. They can be moved to the free list when needed.
Free: Free pages have no useful contents and will be used to fulfill new page-fault requests. The idle process attempts to keep about 75 percent of the pages on the free list zeroed so that they do not have to be zeroed while servicing an anonymous-region page fault. Pages with unknown contents are placed at the front of the free list. Zeroed pages are placed at the end of the free list. The idle process takes pages from the front of the free list, zeros them, marks them as having been zeroed, and puts them on the end of the free list. Page faults that will be filling a page take one from the front of the free list. Page faults that need a zero-filled page take one from the end of the free list. If marked as zeroed, it does not have to be zeroed during the fault service.

The pages of main memory that can be used by user processes are those on the active, inactive, cache, and free lists. Requests for new pages are taken first from the free list if it has pages available otherwise from the cache list. The cache and free lists are subdivided into separate lists organized by page color.

Ideally, the kernel would maintain a working set for each process in the system. It would then know how much memory to provide to each process to minimize the latter's page-fault behavior. The FreeBSD virtual-memory system does not use the working-set model because it lacks accurate information about the reference pattern of a process. It does track the number of pages held by a process via the resident-set size, but it does not know which of the resident pages constitute the working set. In 4.3BSD, the count of resident pages was used in making decisions on whether there was enough memory for a process to be swapped in when that process wanted to run. This feature was not carried over to the FreeBSD virtual-memory system. Because it worked well during periods of high memory demand, this feature should be incorporated in future FreeBSD systems.

Paging Parameters

The memory-allocation needs of processes compete constantly, through the page-fault handler, with the overall system goal of maintaining a minimum threshold of pages in the inactive, cache, and free lists. As the system operates, it monitors main-memory utilization and attempts to run the pageout daemon frequently enough to keep the amount of inactive, cache, and free memory at or above the minimum threshold shown in Table 5.2. When the page-allocation routine, vm_page_alloc(), determines that more memory is needed, it awakens the pageout daemon.

Table 5.2. Available-memory thresholds.
Pool	Minimum	Target
Free	0.7%	3%
Cache	3%	6%
Inactive	0%	4.5%

The number of pages to be reclaimed by the pageout daemon is a function of the memory needs of the system. As more memory is needed by the system, more pages are scanned. This scanning causes the number of pages freed to increase. The pageout daemon determines the memory needed by comparing the number of available-memory pages against several parameters that are calculated during system startup. The desired values for the paging parameters are communicated to the pageout daemon through global variables that may be viewed or changed with sysctl. Likewise, the pageout daemon records its progress in global counters that may be viewed or reset with sysctl. Progress is measured by the number of pages scanned over each interval that it runs.

The goal of the pageout daemon is to maintain the inactive, cache, and free queues between the minimum and target thresholds shown in Table 5.2. The pageout daemon achieves this goal by moving pages from more active queues to less active queues to reach the indicated ranges. It moves pages from the cache list to the free list to maintain the free list minimum. It moves pages from the inactive list to the cache list to keep the cache list near its target. It moves pages from the active list to the inactive list to maintain the inactive list near its target.

The Pageout Daemon

Page replacement is done by the pageout daemon (process 2). The paging policy of the pageout daemon is embodied in the vm_pageout() and vm_pageout_scan() routines. When the pageout daemon reclaims pages that have been modified, it is responsible for writing them to the swap area. Thus, the pageout daemon must be able to use normal kernel-synchronization mechanisms, such as sleep(). It therefore runs as a separate process, with its own process structure and kernel stack. Like init, the pageout daemon is created by an internal fork operation during system startup (see Section 14.4); unlike init, however, it remains in kernel mode after the fork. The pageout daemon simply enters vm_pageout() and never returns. Unlike some other users of the disk I/O routines, the pageout process needs to do its disk operations asynchronously so that it can continue scanning in parallel with disk writes.

Historically, the pages were handled using a least-recently used algorithm. The drawback to this algorithm is that a sudden burst of memory activity can flush many useful pages from the cache. To mitigate this behavior, FreeBSD uses a least-actively used algorithm to preserve pages that have a history of usage so that they will be favored over the once-used pages brought in during a period of high memory demand.

When a page is first brought into memory it is given an initial usage count of three. Further usage information is gathered by the pageout daemon during its periodic scans of memory. As each page of memory is scanned, its reference bit is checked. If the bit is set, it is cleared and the usage counter for the page is incremented (up to a limit of 64) by the number of references to the page. If the reference bit is clear, the usage counter is decremented. When the usage counter reaches zero, the page is moved from the active list to the inactive list. Pages that are repeatedly used build up large usage counts that will cause them to remain on the active list much longer than pages that are used just once.

The goal of the pageout daemon is to keep the number of pages on the inactive, cache, and free lists within their desired ranges. Whenever an operation that uses pages causes the amount of free memory to fall below the minimum thresholds, the pageout daemon is awakened. The pageout-handling algorithm is shown in Figure 5.14 (on pages 192-193). In the following overview, the lettered points are references to the tags down the left side of the code.

The pageout daemon calculates the number of pages that need to be moved from the inactive list to the cache list. As some will later be moved to the free list, enough pages must moved to the cache list to leave its minimum level after the free list has been filled. To avoid saturating the I/O system, the pageout daemon limits the number of I/O operations that it will start concurrently.
Scan the inactive list until the desired number of pages are moved. Skip over busy pages, since they are likely being paged out and can be moved later when they are clean.
If we find a page that is part of an object that has started being used, then it has been moved to the inactive list prematurely. So update its usage count and move it back to the active list. Pages with no associated object have no useful contents and are moved to the free list. Pages that are clean can be moved to the cache list.

Dirty pages need to be paged out, but flushing a page is extremely expensive compared to freeing a clean page. Thus, dirty pages are given extra time on the inactive queue by cycling them through the queue twice before being flushed. They cycle through the list once more while being cleaned. This extra time on the inactive queue will reduce unnecessary I/O caused by prematurely paging out an active page. The clustering checks for up to eight dirty pages on either side of the selected page. The pager is only required to write the selected page. However, it may write as many of the clustered dirty pages as it finds convenient. The scanning of the inactive list stops if the number of pageouts in progress hits its limit. In 4.4BSD, the I/O completions were handled by the pageout daemon. FreeBSD requires that pagers track their own I/O operations including the appropriate updating of the written pages. The written-page update at I/O completion does not move the page from the inactive list to the cache list. Rather, the page remains on the inactive list until it is eventually moved to the cache list during a future pass of the pageout daemon.

Figure 5.14. Pageout handling.

 /*  * Vm_pageout_scan does the dirty work for the pageout daemon.  */ void vm_pageout_scan(void) { [A] page_shortage = free_target + cache_minimum -         (free_count + cache_count);     max_writes_in_progress = 32; [B] for (page = FIRST(inactive list); page; page = next) {         next = NEXT(page);         if (page_shortage < 0)             break;         if (page busy)             continue; [C]     if (page's object has ref_count > 0 &&             page is referenced)             update page active count;             move page to end of active list;             continue;         }         if (page is invalid)             move page to front of free list;             page shortage--;             continue;         } [D]     if (first time page seen dirty) {             mark page as seen dirty;             move to end of inactive list;             continue;         }         check for cluster of dirty pages around page;         start asynchronous write of page cluster;         move to end of inactive list;         page_shortage--;         max_writes_in_progress--;         if (max_writes_in_progress == 0)             break;     } [E] page_shortage = free_target + cache_minimum + inactive_target         - (free_count + cache_count + inactive_count); [F] for (page = FIRST(active list); page; page = next) {         next = NEXT(page);         if (page_shortage < 0)             break;         if (page busy)             continue;         if (page's object has ref_count > 0) {             update page active count;             move page to end of active list;             continue;         } [G]     decrement page active count;         if (page active count > 0) {             move page to end of active list;             continue; [G]     decrement page active count;         if (page active count > 0) {             move page to end of active list;             continue;         }         page_shortage--;         if (page is clean)             move page to end of cache list;         else             move page to end of inactive list;     } [H] page_shortage = free_minimum - free_count;     for (page = FIRST(cache list); page; page = next) {         next = NEXT(page);         if (page_shortage < 0)             break;         if (page busy)             continue;         move page to front of free list;     } [I] if (targets not met)         request swap-out daemon to run; [J] if (nearly all memory and swap in use)         kill biggest process; }

The pageout daemon calculates the number of pages that need to be moved from the active list to the inactive list. As some will eventually need to be moved to the cache and free lists, enough pages must be moved to the inactive list to leave it at its target level after the cache and free lists have been filled.
Scan the active list until the desired number of pages are moved. Skip over busy pages as they are likely being actively used. If we find a page that is part of an active object, update its usage count and move it to the end of the active list.
The page is not active, so decrement its usage count. If its usage is still above zero, move it to the end of the active list. Otherwise, move it to the inactive list if it is dirty or the cache list if it is clean.
Calculate the number of pages that need to be moved to the free list. It only fills the free list to its minimum level. It prefers to keep pages on the cache list since they can be reclaimed if their object or file is reactivated. Once on the free list the identity of a page is lost. Thus, the pageout daemon delays moving pages to the free list until just before they are expected to be needed.
If the page-count targets have not been met, the swap-out daemon is started (see next subsection) to try to clear additional memory.
The FreeBSD 5.2 kernel does not impose any limits on the amount of virtual memory that it will grant. If it finds that it has nearly filled its memory and swap space, it avoids going into deadlock by killing off the largest process. One day virtual-memory accounting and limits will be added to the system so as to avoid this crude resource-control mechanism.

Note that the page and object locking has been elided in Figure 5.14 to simplify the explanation.

Swapping

Although swapping is generally avoided, there are several times when it is used in FreeBSD to address a serious memory shortage. Swapping is done in FreeBSD when any of the following occurs:

The system becomes so short of memory that the paging process cannot free memory fast enough to satisfy the demand. For example, a memory shortfall may happen when multiple large processes are run on a machine lacking enough memory for the minimum working sets of the processes.
Processes are completely inactive for more than 10 seconds. Otherwise, such processes would retain a few pages of memory associated with their user structure and thread stacks.

Swap operations completely remove a process from main memory, including the process page tables, the pages of the data and the stack segments that are not already in swap space, and the user structure and thread stacks.

Process swapping is invoked only when paging is unable to keep up with memory needs or when short-term memory needs warrant swapping a process. In general, the swap-scheduling mechanism does not do well under heavy load; system performance is much better when memory scheduling can be done by the page-replacement algorithm than when the swap algorithm is used.

Swap out is driven by the swap-out daemon, vmdaemon (process 3). The swap-out policy of the vmdaemon is embodied in the vm_daemon() routine. If the pageout daemon can find any processes that have been sleeping for more than 10 seconds (swap_idle_threshold2, the cutoff for considering the time sleeping to be "a long time"), it will swap out the one sleeping for the longest time. Such processes have the least likelihood of making good use of the memory that they occupy; thus, they are swapped out even if they are small. If none of these processes are available, the pageout daemon will swap out a process that has been sleeping for as briefly as 2 seconds (swap_idle_thresholdl). These criteria attempt to avoid swapping entirely until the pageout daemon is clearly unable to keep enough memory free.

In 4.4BSD, if memory was still desperately low, the swap-out daemon would select to swap out the runnable process that had been resident the longest. Once swapping of runnable processes had begun, the processes eligible for swapping would take turns in memory so that no process was frozen out entirely. The FreeBSD swap-out daemon will not select a runnable processes to swap out. So, if the set of runnable processes do not fit in memory, the machine will effectively deadlock. Current machines have enough memory that this condition does not arise. If it does, the 4.4BSD algorithm will need to be restored.

The mechanics of doing a swap out are simple. The swapped-in process flag P_INMEM is cleared to show that the process is not resident in memory. The PS_SWAPPINGOUT flag is set while the swap out is being done so that neither a second swap out nor a swap in is attempted at the same time. If a runnable process is to be swapped (which currently never happens), it needs to be removed from the runnable process queue. The user structure of the process and the kernel stack of its threads are then marked as pageable, which allows the user structure and stack pages, along with any other remaining pages for the process, to be paged out via the standard pageout mechanism. The swapped-out process cannot be run until after it is swapped back into memory.

The Swap-In Process

Swap-in operations are done by the swapping process, swapper (process 0). This process is the first one created by the system when the latter is started. The swap-in policy of the swapper is embodied in the scheduler() routine. This routine swaps processes back in when memory is available and they are ready to run. At any time, the swapper is in one of three states:

Idle: No swapped-out processes are ready to be run. Idle is the normal state.
Swapping in: At least one runnable process is swapped out, and scheduler() attempts to find memory for it.
Swapping out: The system is short of memory, or there is not enough memory to swap in a process. Under these circumstances, scheduler() awakens the pageout daemon to free pages and to swap out other processes until the memory shortage abates.

If more than one swapped-out process is runnable, the first task of the swapper is to decide which process to swap in. This decision may affect the decision whether to swap out another process. Each swapped-out process is assigned a priority based on

The length of time it has been swapped out
Its nice value
The amount of time it was asleep since it last ran

In general, the process that has been swapped out longest or was swapped out because it had slept for a long time before being swapped will be brought in first. Once a process is selected, the swapper checks to see whether there is enough memory free to swap in the process. Historically, the system required as much memory to be available as was occupied by the process before that process was swapped. Under FreeBSD, this requirement was reduced to a requirement that the number of pages on the free and cache lists be at least equal to the minimum free-memory threshold. If there is enough memory available, the process is brought back into memory. The user structure is swapped in immediately, but the process loads the rest of its working set by demand paging from the backing store. Thus, not all the memory that is needed by the process is used immediately. Earlier BSD systems tracked the anticipated demand and would only swap in runnable processes as free memory became available to fulfill their expected needs. FreeBSD allows all swapped-out runnable processes to be swapped in as soon as there is enough memory to load their user structure and thread stacks.

The procedure for swap in of a process is the reverse of that for swap out:

1. Memory is allocated for the user structure and the kernel stack of each of its threads, and they are read back from swap space.

2. The process is marked as resident, and its runnable threads are returned to the run queue (i.e., those threads that are not stopped or sleeping).

After the swap in completes, the process is ready to run like any other, except that it has no resident pages. It will bring in the pages that it needs by faulting them.