Section 10.3. The Page Scanner

10.3. The Page Scanner

The page scanner is the memory management daemon that manages systemwide physical memory. The page scanner and the virtual memory page fault mechanism are the core of the demand-paged memory allocation system used to manage Solaris memory. When there is a memory shortage, the page scanner runs to steal memory from address spaces by taking pages that haven't been used recently, syncing them up with their backing store (swap space if they are anonymous pages), and freeing them. If paged-out virtual memory is required again by an address space, then a memory page fault occurs when the virtual address is referenced and the pages are recreated and copied back from their backing store.

The balancing of page stealing and page faults determines which parts of virtual memory will be backed by real physical memory and which will be moved out to swap. The page scanner does not understand the memory usage patterns or working sets of processes; it only knows reference information on a physical page-by-page basis. This policy is often referred to as global page replacement; the alternative process-based page management is known as local page replacement.

The subtleties of which pages are stolen govern the memory allocation policies and can affect different workloads in different ways. During the life of the Solaris kernel, only two significant changes in memory replacement policies have occurred:

Enhancements to minimize page stealing from extensively shared libraries and executables
Enhancements to allow auto-tuning of the fastscan and handspread parameters

We discuss these changes in more detail when we describe page scanner implementation.

10.3.1. Page Scanner Operation

The page scanner tracks page usage by reading a per-page hardware bit from the hardware MMU for each page. Two bits are kept for each page; they indicate whether the page has been modified or referenced since the bits were last cleared. The page scanner uses the bits as the fundamental data to decide which pages of memory have been used recently and which have not.

The page scanner is a kernel thread, which is awakened when the amount of memory on the free-page list falls below a system threshold, typically 1/64th of total physical memory. The page scanner scans through pages in physical page order, looking for pages that haven't been used recently to page out to the swap device and free. The algorithm that determines whether pages have been used resembles a clock face and is known as the two-handed clock algorithm. This algorithm views the entire physical page list as a circular list, where the last physical page wraps around to the first. Two hands sweep through the physical page list, as shown in Figure 10.6.

Figure 10.6. Two-Handed Clock Algorithm

The two hands, the front hand and back hand, rotate clockwise in page order around the list. The front hand rotates ahead of the back hand, clearing the referenced and modified bits for each page. The trailing back hand then inspects the referenced and modified bits some time later. Pages that have not been referenced or modified are swapped out and freed. The rate at which the hands rotate around the page list is controlled by the amount of free memory on the system, and the gap between the front hand and back hand is fixed by a dynamically calculated value, handspreadpages.

10.3.2. Page-Out Algorithm and Parameters

The page-out algorithm is controlled by several parameters, some of which are calculated at system startup by the amount of memory in the system, and some of which are calculated dynamically based on memory allocation and paging activity.

The parameters that control the clock hands do two things: They control the rate at which the scanner scans through pages, and they control the time (or distance) between the front hand and the back hand.

Starting with Solaris 9, a new maximum clamp is calculated for the page scanner, based on number of pages the scanner could scan in one second at its maximum rate. This number is calculated based on a simple experiment when the scanner first starts, and is stored in the dynamic variable pageout_new_spread. We can check the calculated value with mdb:

# mdb -k > pageout_new_spread/E pageout_new_spread: pageout_new_spread:       127678

The distance between the back hand and the front hand is handspreadpages and is expressed in units of pages. The maximum distance between the front hand and backhand is calculated based on the scanner's performance (pageout_new_spread).

10.3.2.1. Scan Rate Parameters

The scanner starts scanning when free memory is lower than lotsfree number of pages free plus a small buffer factor, deficit. The scanner starts scanning at a rate of slowscan pages per second at this point and gets faster as the amount of free memory approaches zero. The system parameter lotsfree is calculated at startup as 1/64th memory, and the parameter deficit is either zero or a small number of pagesset by the page allocator at times of large memory allocation to let the scanner free a few more pages above lotsfree in anticipation of more memory requests.

Figure 10.7 shows the rate at which the scanner scans increases linearly as free memory ranges between lotsfree and zero. The scanner starts scanning at the minimum rate set by slowscan when memory falls below lotsfree and then increases to calculated maximum if memory falls low enough. The maximum scan rate is set to the value of pageout_new_spread and stored in the global variable fastscan, based on the maximum number of pages the scanner can scan per second.

Figure 10.7. Page Scanner Rate, Interpolated by Number of Free Pages

The number of pages scanned increases from the slowest rate (set by slowscan when lotsfree pages are free) to a maximum determined by the dynamic variable fastscan. Free memory never actually reaches zero, but for simplicity the algorithm calculates the maximum interpolated rate against the free memory ranging between lotsfree and zero. In our example system with 1 Gbyte of physical memory (shown in Figure 10.7), we can see that the scanner starts scanning when free memory falls to 16 Mbytes plus the short-term memory deficit.

For this example, we'll assume that the deficit is zero. When free memory falls to 16 Mbytes, the scanner will wake up and start examining 100 pages per second, according to the system parameter slowscan. The slowscan parameter is 100 by default on Solaris systems, and fastscan is dynamically calculated. If free memory falls to 12 Mbytes (1,536 8K-byte pages), the scanner scans at a higher rate, according to the page scanner interpolation shown in the following equation:

We can read the calculated value for fastscan via mdb:

# mdb -k > fastscan/E fastscan: fastscan:       127678

If we convert free memory and lotsfree to numbers of pages (free memory of 12 Mbytes is 1,536 pages, and lotsfree is set to 16 Mbytes, or 2,048 pages), then we scan at 31,994 pages per second.

By default, the scanner is run four times per second when there is a memory shortage. If the amount of free memory falls below the system parameter minfree, the scanner is awoken by the page allocator for each page-create request. This scheme helps the scanner try to keep at least minfree pages on the free list.

10.3.2.2. Not-Recently-Used Time

The time between the front hand and back hand varies according to the number of pages between the front hand and back hand and the rate at which the scanner is scanning. The time between the front hand clearing the reference bit and the back hand checking the reference bit is a significant factor that affects the behavior of the scanner because it controls the amount of time that a page can be left alone before it is potentially stolen by the page scanner. A short time between the reference bit being cleared and checked means that only the most active pages remain intact; a long time means that only the largely unused pages are stolen. The ideal behavior is the latter because we want only the least recently used pages stolen, which means we want a long time between the front and back hands.

The time between clearing and checking of the reference bit can vary from just a few seconds to several hours, depending on the scan rate.

10.3.3. Shared Library Optimizations

A subtle optimization added to the page scanner prevents it from stealing pages from extensively shared libraries. The page scanner looks at the share reference count for each page; if the page is shared more than a certain amount, then it is skipped during the page scan operation. An internal parameter, po_share, sets the threshold for the amount of shares a page can have before it is skipped. If the page has more than po_share mappings (i.e., it's shared by more than po_share processes), then it is skipped. By default, po_share starts at 8; each time around, it is decremented unless the scan around the clock does not find any page to free, in which case po_share is incremented. The po_share parameter can float between 8 and 134217728.

10.3.3.1. Page Scanner CPU Utilization Clamp

A CPU utilization clamp on the scan rate prevents the page-out daemon from using too much processor time. Two internal limits govern the desired and maximum CPU time that the scanner should use. Two parameters, min_percent_cpu and max_percent_cpu, govern the amount of CPU that the scanner can use. Like the scan rate, the actual amount of CPU that can be used at any given time is interpolated by the amount of free memory. It ranges from min_percent_cpu when free memory is at lotsfree (cachefree with priority paging enabled) to max_percent_cpu if free memory were to fall to zero. The defaults for min_percent_cpu and max_percent_cpu are 4% and 80% of a single CPU, respectively (the scanner is single threaded).

10.3.4. Parameters That Limit Pages Paged Out

Another parameter, maxpgio, limits the rate at which I/O is queued to the swap devices. It is set low to prevent saturation of the swap devices. The parameter defaults to 40 I/Os per second on x86 architectures and to 60 I/Os per second on SPARC.

Because the page-out daemon also pages out dirty file system pages that it finds during scanning, this parameter can also indirectly limit file system throughput. File system I/O requests are normally queued and written by user processes and hence are not subject to maxpgio. However, when a lot of file system write activity is going on and many dirty file system pages are in memory, the page-out scanner trips over these and queues these I/Os; as a result, the maxpgio limit can sometimes affect file system write throughput.

10.3.4.1. Summary of Page Scanner Parameters

Table 10.3 describes the parameters that control the page-out process in the current Solaris and patch releases.

Table 10.3. Page Scanner Parameters
Parameter	Description	Min	Default
`lotsfree`	The scanner starts stealing anonymous memory pages when free memory falls below `lotsfree`.	512K	1/64th of memory
`desfree`	If free memory falls below `desfree`, then the page-out scanner is started 100 times/second.	minfree	`lotsfree/2`
`minfree`	If free memory falls below `minfree`, then the page scanner is signaled to start every time a new page is created.		`desfree/2`
`throttlefree`	The number at which point the `page_create` routines make the caller wait until free pages are available.		`minfree`
`slowscan`	The rate of pages scanned per second when free memory = `lotsfree`.		100
`maxpgio`	A throttle for the maximum number of pages per second that the swap device can handle.	~60	60 or 90 pages/s

10.3.5. Page Scanner Implementation

The page scanner is implemented as two kernel threads, both of which use process number 2, "pageout." One thread scans pages, and the other thread pushes the dirty pages queued for I/O to the swap device. In addition, the kernel callout mechanism wakes the page scanner thread when memory is insufficient. (The kernel callout scheduling mechanism is discussed in detail in Section 19.2.)

The scanner schedpaging() function is called four times per second by a callout placed in the callout table. The schedpaging() function checks whether free memory is below the threshold (lotsfree or cachefree) and, if required, triggers the scanner thread. The page scanner is not only awakened by the callout thread, it is also triggered by the page allocator if memory falls below throttlefree. Figure 10.8 illustrates how the page scanner works.

Figure 10.8. Page Scanner Architecture

When called, the schedpaging routine calculates two setup parameters for the page scanner thread: the number of pages to scan and the number of CPU ticks that the scanner thread can consume while doing so. The number of pages and cpu ticks are calculated according to the equations shown in Section 10.3.2.1 and Section 10.3.3.1. Once the scanning parameters have been calculated, schedpaging triggers the page scanner through a condition variable wakeup.

The page scanner thread cycles through the physical page list, progressing by the number of pages requested each time it is woken up. The front hand and the back hand each have a page pointer. The front hand is incremented first so that it can clear the referenced and modified bits for the page currently pointed to by the front hand. The back hand is then incremented, and the status of the page pointed to by the back hand is checked by the check_page() function. At this point, if the page has been modified, it is placed in the dirty page queue for processing by the page-out thread. If the page was not referenced (it's clean!), then it is simply freed.

Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can write them out to their backing store. We use another thread so that a deadlock can't occur while the system is waiting to swap a page out. The page-out thread uses a preinitialized list of async buffer headers as the queue for I/O requests. The list is initialized with 256 entries, which means the queue can contain at most 256 entries. The number of entries preconfigured on the list is controlled by the async_request_size system parameter. Requests to queue more I/Os onto the queue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued has exceeded the system maximum set by the maxpgio parameter.

The page-out thread simply removes I/O entries from the queue and initiates I/O on it by calling the vnode putpage() function for the page in question. In the Solaris kernel, this function calls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer. The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these out together. The klustsize parameter controls the number of pages that swapfs will cluster; the defaults are shown in Table 10.4. (See Section 9.8.)

Table 10.4. `swapfs` Cluster Sizes
Platform	Number of Clustered Pages (set by klustsize)
sun4u	16 (128k)
i86	14 (56k)

10.3.6. The Memory Scheduler

In addition to the page-out process, the CPU scheduler/dispatcher can swap out entire processes to conserve memory. This operation is separate from page-out. Swapping out a process involves removing all of a process's thread structures and private pages from memory, and setting flags in the process table to indicate that this process has been swapped out. This is an inexpensive way to conserve memory, but it dramatically affects a process's performance and hence is used only when paging fails to free enough memory consistently.

The memory scheduler is launched at boot time and does nothing unless memory is consistently less than desfree memory (30 second average). At this point, the memory scheduler starts looking for processes that it can completely swap out. The memory scheduler will soft-swap out processes if the shortage is minimal or hard-swap out processes in the case of a larger memory shortage.

10.3.6.1. Soft Swapping

Soft swapping takes place when the 30-second average for free memory is below desfree. Then, the memory scheduler looks for processes that have been inactive for at least maxslp seconds. When the memory scheduler finds a process that has been sleeping for maxslp seconds, it swaps out the thread structures for each thread, then pages out all of the private pages of memory for that process.

10.3.6.2. Hard Swapping

Hard swapping takes place when all of the following are true:

More than two processes are on the run queue, waiting for CPU.
The average free memory over 30 seconds is consistently less than desfree.
Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on.

When hard swapping is invoked, a much more aggressive approach is used to find memory. First, the kernel is requested to unload all modules and cache memory that are not currently active, then processes are sequentially swapped out until the desired amount of free memory is returned. Parameters that affect the Memory Scheduler are shown in Table 10.5.

Table 10.5. Memory Scheduler Parameters
Parameter	Effect on Memory Scheduler
`desfree`	If the average amount of free memory falls below `desfree` for 30 seconds, then the memory scheduler is invoked.
`maxslp`	When soft-swapping, the memory scheduler starts swapping processes that have slept for at least `maxslp` seconds. The default for `maxslp` is 20 seconds and is tunable.
`maxpgio`	When the run queue is greater than 2, free memory is below `desfree`, and the paging rate is greater than `maxpgio`, then hard swapping occurs, unloading kernel modules and process memory.