Section 5.10. The Pager Interface

5.10. The Pager Interface

The pager interface provides the mechanism by which data are moved between backing store and physical memory. The FreeBSD pager interface is an evolution of the interface present in Mach 2.0 as evolved by 4.4BSD. The interface is page based, with all data requests made in multiples of the software page size. Vm_page structures are passed around as descriptors providing the backing-store offset and physical-memory address of the desired data. This interface should not be confused with the Mach 3.0 external paging interface [Young, 1989], where pagers are typically user applications outside the kernel and are invoked via asynchronous remote procedure calls using the Mach interprocess-communication mechanism. The FreeBSD interface is internal in the sense that the pagers are compiled into the kernel and pager routines are invoked via simple function calls.

Each virtual-memory object has a pager type, pager handle, and pager private data associated with it. Conceptually, the pager describes a logically contiguous piece of backing store, such as a chunk of swap space or a disk file. The pager type identifies the pager responsible for supplying the contents of pages within the object. Each pager registers a set of functions that define its operations. These function sets are stored in an array indexed by pager type. When the kernel needs to do a pager operation, it uses the pager type to index into the array of pager functions and then selects the routine that it needs such as getting or putting pages. For example,

 (*pagertab[object->type]->pgo_putpages)     (object, vmpage, count, flags, rtvals);

pushes count pages starting with page vmpage from object.

A pager type is specified when an object is created to map a file, device, or piece of anonymous memory into a process address space. The pager manages the object throughout its lifetime. When a page fault occurs for a virtual address mapping a particular object, the fault-handling code allocates a vm_page structure and converts the faulting address to an offset within the object. This offset is recorded in the vm_page structure, and the page is added to the list of pages cached by the object. The page frame and object are then passed to the underlying pager routine. The pager routine is responsible for filling the vm_page structure with the appropriate initial value for that offset of the object that it represents.

The pager is also responsible for saving the contents of a dirty page if the system decides to push out the latter to backing store. When the pageout daemon decides that a particular page is no longer needed, it requests the object that owns the page to free the page. The object first passes the page with the associated logical offset to the underlying pager to be saved for future use. The pager is responsible for finding an appropriate place to save the page, doing any I/O necessary for the save, and then notifying the object that the page is being freed. When it is done, the pager marks the page as clean so that the pageout daemon can move the vm_page structure to the cache or free list for future use.

There are seven routines associated with each pager type; see Table 5.1. The pgo_init() routine is called at boot time to do any one-time type-specific initializations, such as allocating a pool of private pager structures. The pgo_alloc() routine associates a pager with an object as part of the creation of the object. The pgo_dealloc() routine disassociates a pager from an object as part of the destruction of the object. Objects are created either by the mmap system call or internally as part of the creation of a process as part of the fork or exec system calls.

Table 5.1. Operations defined by a pager.
Operation	Description
Pgo_init()	initialize pager
pgo_alloc()	allocate pager
pgo_dealloc()	deallocate pager
pgo_getpages()	read page(s) from backing store
pgo_putpages()	write page(s) to backing store
pgo_haspage()	check whether backing store has a page
pgo_pageunswapped()	remove a page from backing store (swap pager only)

Pgo_getpages() is called to return one or more pages of data from a pager. The main use of this routine is by the page-fault handler. Pgo_putpages() writes back one or more pages of data. This routine is called by the pageout daemon to write back one or more pages asynchronously, and by msync to write back one or more pages synchronously or asynchronously. Both the get and put routines are called with an array of vm_page structures and a count indicating the affected pages.

The pgo_haspage() routine queries a pager to see whether it has data at a particular backing-store offset. This routine is used in the clustering code of the page-fault handler to determine whether pages on either side of a faulted page can be read in as part of a single I/O operation. It is also used when collapsing objects to determine if the allocated pages of a shadow object completely obscure the allocated pages of the object that it shadows.

The four types of pagers supported by the system are described in the next four subsections.

Vnode Pager

The vnode pager handles objects that map files in a filesystem. Whenever a file is opened either explicitly by open or implicitly by exec, the system must find an existing vnode that represents it or, if there is no existing vnode for the file, allocate a new vnode for it. Part of allocating a new vnode is to allocate an object to hold the pages of the file and to associate the vnode pager with the object. The object handle is set to point to the vnode and the private data stores the size of the file. Any time the vnode changes size, the object is informed by a call to vnode_pager_setsize().

When a pagein request is received by the vnode-pager pgo_getpages() routine, it is passed an array of physical pages, the size of the array, and the index into the array of the page that is required. Only the required page must be read, but the pgo_getpages() routine is encouraged to provide as many of the others as it can easily read at the same time. For example, if the required page is in the middle of the block of a file, the filesystem will usually read the entire file block since the file block can be read with a single I/O operation. The larger read will fill in the required page along with the pages surrounding it. The I/O is done using a physical-I/O buffer that maps the pages to be read into the kernel address space long enough for the pager to call the device-driver strategy routine to load the pages with the file contents. Once the pages are filled, the kernel mapping can be dropped, the physical-I/O buffer can be released, and the pages can be returned.

When the vnode pager is asked to save a page to be freed, it simply arranges to write the page back to the part of the file from which the page came. The request is made with the pgo_getpages() routine, which is passed an array of physical pages, the size of the array, and the index into the array of the page that must be written. Only the required page must be written, but the pgo_putpages() routine is encouraged to write as many of the others as it can easily handle at the same time. The filesystem will write out all the pages that are in the same filesystem block as the required page. As with the pgo_getpages() routine, the pages are mapped into the kernel only long enough to do the write operation.

If a file is being privately mapped, then modified pages cannot be written back to the filesystem. Such private mapping must use a shadow object with a swap pager for all pages that are modified. Thus, a privately mapped object will never be asked to save any dirty pages to the underlying file.

Historically, the BSD kernel had separate caches for the filesystems and the virtual memory. FreeBSD has eliminated the filesystem buffer cache by replacing it with the virtual-memory cache. Each vnode has an object associated with it, and the blocks of the file are stored in the pages associated with the object. The file data is accessed using the same pages whether they are mapped into an address space or accessed via read and write. An added benefit of this design is that the filesystem cache is no longer limited by the address space in the kernel that can be dedicated to it. Absent other demands on the system memory, it can all be dedicated to caching filesystem data.

Device Pager

The device pager handles objects representing memory-mapped hardware devices. Memory-mapped devices provide an interface that looks like a piece of memory. An example of a memory-mapped device is a frame buffer, which presents a range of memory addresses with one word per pixel on the screen. The kernel provides access to memory-mapped devices by mapping the device memory into a process's address space. The process can then access that memory without further operating-system intervention. Writing to a word of the frame-buffer memory causes the corresponding pixel to take on the appropriate color and brightness.

The device pager is fundamentally different from the other three pagers in that it does not fill provided physical-memory pages with data. Instead, it creates and manages its own vm_page structures, each of which describes a page of the device space. The head of the list of these pages is kept in the pager private-data area of the object. This approach makes device memory look like wired physical memory. Thus, no special code should be needed in the remainder of the virtual-memory system to handle device memory.

When a device is first mapped, the device-pager allocation routine will validate the desired range by calling the device d_mmap () routine. If the device allows the requested access for all pages in the range, an empty page list is created in the private-data area of the object that manages the device mapping. The device-pager allocation routine does not create vm_page structures immediately they are created individually by the pgo_getpages() routine as they are referenced. The reason for this late allocation is that some devices export a large memory range in which either not all pages are valid or the pages may not be accessed for common operations. Complete allocation of vm_page structures for these sparsely accessed devices would be wasteful.

The first access to a device page will cause a page fault and will invoke the device-pager pgo_getpages() routine. The device pager creates a vm_page structure, initializes the latter with the appropriate object offset and a physical address returned by the device d_mmap() routine, and flags the page as fictitious. This vm_page structure is added to the list of all such allocated pages for the object. Since the fault code has no special knowledge of the device pager, it has preallocated a physical-memory page to fill and has associated that vm_page structure with the object. The device-pager routine removes that vm_page structure from the object, returns the structure to the free list, and inserts its own vm_page structure in the same place.

The device-pager pgo_putpages() routine expects never to be called and will panic if it is. This behavior is based on the assumption that device-pager pages are never entered into any of the paging queues and hence will never be seen by the pageout daemon. However, it is possible to msync a range of device memory. This operation brings up an exception to the higher-level virtual-memory system's ignorance of device memory: The object page-cleaning routine will skip pages that are flagged as fictitious.

Finally, when a device is unmapped, the device-pager deallocation routine is invoked. This routine deallocates all the vm_page structures that it allocated.

Physical-Memory Pager

The physical-memory pager handles objects that contain nonpagable memory. Their only use is for the System V shared-memory interface, which can be configured to use nonpagable memory instead of the default swappable memory.

The first access to a physical-memory-pager page will cause a page fault and will invoke the pgo_getpages() routine. Like the swap pager, the physical-memory pager zero-fills pages when they are first faulted. Unlike the swap pager, the page is marked as unmanaged so that it will not be considered for replacement by the page-out daemon. Marking its pages unmanaged makes the memory for the physical-memory pager look like wired physical memory. Thus, no special code is needed in the remainder of the virtual-memory system to handle physical-memory-pager memory.

The pgo_putpages() routine of the physical-memory-pager does not expect to be called, and it panics if it is. This behavior is based on the assumption that physical-memory-pager pages are never entered into any of the paging queues and hence will never be seen by the pageout daemon. However, it is possible to msync a range of device memory. This operation brings up an exception to the higher-level virtual-memory system's ignorance of physical-memory-pager memory: The object page-cleaning routine will skip pages that are flagged as unmanaged.

Finally, when an object using a physical-memory-pager is freed, all its pages have their unmanaged flag cleared and are released back to the list of free pages.

Swap Pager

The term swap pager refers to two functionally different pagers. In the most common use, swap pager refers to the pager that is used by objects that map anonymous memory. This pager has sometimes been referred to as the default pager because it is the pager that is used if no other pager has been requested. It provides what is commonly known as swap space: nonpersistent backing store that is zero filled on first reference. When an anonymous object is first created, it is assigned the default pager. The default pager allocates no resources and provides no storage backing. The default pager handles page faults (pgo_getpage()) by zero filling and page queries (pgo_haspage()) as not held. The expectation is that free memory will be plentiful enough that it will not be necessary to swap out any pages. The object will simply create zero-filled pages during the process lifetime that can all be returned to the free list when the process exits. When an object is freed with the default pager, no pager cleanup is required, since no pager resources were allocated.

However, on the first request by the pageout daemon to remove an active page from an anonymous object, the default pager replaces itself with the swap pager. The role of the swap pager is swap-space management: figuring out where to store dirty pages and how to find dirty pages when they are needed again. Shadow objects require that these operations be efficient. A typical shadow object is sparsely populated: It may cover a large range of pages, but only those pages that have been modified will be in the shadow object's backing store. In addition, long chains of shadow objects may require numerous pager queries to locate the correct copy of an object page to satisfy a page fault. Hence, determining whether a pager contains a particular page needs to be fast, preferably requiring no I/O operations. A final requirement of the swap pager is that it can do asynchronous writeback of dirty pages. This requirement is necessitated by the pageout daemon, which is a single-threaded process. If the pageout daemon blocked waiting for a page-clean operation to complete before starting the next operation, it is unlikely that it could keep enough memory free in times of heavy memory demand.

In theory, any pager that meets these criteria can be used as the swap pager. In Mach 2.0, the vnode pager was used as the swap pager. Special paging files could be created in any filesystem and registered with the kernel. The swap pager would then suballocate pieces of the files to back particular anonymous objects. One obvious advantage of using the vnode pager is that swap space can be expanded by the addition of more swap files or the extension of existing ones dynamically (i.e., without rebooting or reconfiguring of the kernel). The main disadvantage is that the filesystem does not provide as much bandwidth as direct access to the disk.

The desire to provide the highest possible disk bandwidth led to the creation of a special raw-partition pager to use as the swap pager for FreeBSD. Previous versions of BSD also used dedicated disk partitions, commonly known as swap partitions, so this partition pager became the swap pager. The remainder of this section describes how the swap pager is implemented and how it provides the necessary capabilities for backing anonymous objects.

In 4.4BSD, the swap pager preallocated a fixed-sized structure to describe the object. For a large object the structure would be large even if only a few pages of the object were pushed to backing store. Worse, the size of the object was frozen at the time of allocation. Thus, if the anonymous area continued to grow (such as the stack or heap of a process), a new object had to be created to describe the expanded area. On a system that was short of memory, the result was that a large process could acquire many anonymous objects. Changing the swap pager to handle growing objects dramatically reduced this object proliferation. Other problems with the 4.4BSD swap pager were that its simplistic management of the swap space led to fragmentation, slow allocation under load, and deadlocks brought on by its need to allocate kernel memory during periods of shortage. For all these reasons, the swap pager was completely rewritten in FreeBSD 4.0.

Swap space tends to be sparsely allocated. On average, a process only accesses about half of its allocated address space during its lifetime. Thus, only about half the pages in an object ever come into existence. Unless the machine is under heavy memory pressure and the process is long-lived, most of the pages in the object that do come into existence will never be pushed to backing store. So the new swap pager replaced the old fixed-size block map for each object with a method that allocates a structure for each set of swap blocks that gets allocated. Each structure can track a set of up to 32 contiguous swap blocks. A large object with two pages swapped out will use at most two of these structures, and only one if the two swapped pages are close to each other (as they often are). The amount of memory required to track swap space for an object is proportional to the number of pages that have been pushed to swap rather than to the size of the object. The size of the object is no longer frozen when its first page is swapped out, since any pages that are part of its larger size can be accommodated.

The structures that track swap space usage are kept in a global hash table managed by the swap pager. While it might seem logical to store the structures separately on lists associated with the object of which they are a part, the single global hash table has two important advantages:

It ensures a short time to determine whether a page of an object has been pushed to swap. If the structures were linked onto a list headed by the object, then objects with many swapped pages would require the traversal of a long list. The long list could be shortened by creating a hash table for every object, but that would require much more memory than simply allocating a single large hash table that could be used by all objects.
It allows operations that need to scan all the allocated swap blocks to have a centralized place to find them rather than needing to scan all the anonymous objects in the system. An example is the swapoff system call that removes a swap partition from use. It needs to page in all the blocks from the device that is to be taken out of service.

The free space in the swap area is managed with a bitmap with one bit for each page-sized block of swap space. The bitmap for the entire swap area is allocated when the swap space is first added to the system. This initial allocation avoids the need to allocate kernel memory during critical low-memory swapping operations. Although when it is first allocated the bitmap requires more space than the block list needed when first allocated, its size does not change with use, whereas the block list grew to a size bigger than the bitmap as the swap area became fragmented. The system tends to swap when it is low on memory. To avoid potential deadlocks, kernel memory should not be allocated at such times.

Doing a linear scan of the swap-block bitmaps to find free space would be unacceptably slow. Thus, the bitmap is organized in a radix-tree structure with free-space hinting in the radix-node structures. The use of radix-tree structures makes swap-space allocation and release a constant-time operation. To reduce fragmentation, the radix tree can allocate large contiguous chunks at once, skipping over smaller fragmented chunks.

A future improvement would be to keep track of the different-sized free areas as swap allocations are done similarly to the way that the filesystem tracks the different sizes of free space. This free-space information would increase the probability of doing contiguous allocation and improve locality of reference.

Swap blocks are allocated at the time that swap out is done. They are freed when the page is brought back in and becomes dirty or the object is freed.

The swap pager is responsible for managing the I/O associated with the pgo_putpages() request. Once it identifies the set of pages within the pgo_putpages() request that it will be able to write, it must allocate a buffer and have those pages mapped into it. Because the swap pager does not synchronously wait while the I/O is done, it does not regain control after the I/O operation completes. Therefore, it marks the buffer with a callback flag and sets the routine for the callback to be swp_pager_async_iodone().

When the push completes, swp_pager_async_iodone() is called. Each written page is marked as clean, has its busy bit cleared, and calls the vm_page_io_finish() routine to notify the pageout daemon that the write has completed and to awaken any processes waiting for it. The swap pager then unmaps the pages from the buffer and releases it. A count of pageouts-in-progress is kept for the pager associated with each object; this count is decremented when the pageout completes, and, if the count goes to zero, a wakeup() is issued. This operation is done so that an object that is deallocating a swap pager can wait for the completion of all page-out operations before freeing the pager's references to the associated swap space.

Because the number of swap buffers is constant, the swap pager must take care to ensure that it does not use more than its fair share. Once this limit is reached, the pgo_putpages() operations blocks until one of the swap pager's outstanding writes completes. This unexpected blocking of the pageout daemon is an unfortunate side effect of pushing the buffer management down into the pagers. Any single pager hitting its buffer limit stops the pageout daemon. While the page-out daemon might want to do additional I/O operations using other I/O resources such as the network, it is prevented from doing so. Worse, the failure of any single pager can deadlock the system by preventing the pageout daemon from running.

5.10. The Pager Interface

Table 5.1. Operations defined by a pager.

Vnode Pager

Device Pager

Physical-Memory Pager

Swap Pager